Introduction

Our principal goal

Our main goal is evaluate different models and choose the best within them to determine if new applicants represent a good or bad credit risk.

Under this context, we decided to use the methology named “Cross-industry standard process for data mining” (CRISP-DM). This model consists in six phases that naturally describes the data science life cycle. Below, you will find a picture that describe this process.

Figure 1: CRISP-DM Process

CRISP-DM Process

The data

For this project, we will use the data “GermanCredit.csv” provided in the course Projects in Data Analytics for Decision Making given by Professor Jacques Zuber, which contains 1’000 past credit applicants, described by 30 variables.

data<-read.csv2(here::here("data/GermanCredit.csv"), dec=".", header=T)

Our questions

In order to make a better analysis, we ask ourselves some questions that we will try to solve through the EDA and the applied models. We will seek to answer to these questions in the conclusions section:

  1. Are there any variables that could be grouped?
  2. Have we used all the original independent variables of the model?
  3. Is the data balanced regarding the answer variable?
  4. Does it make sense to balance the data to avoid the model being biased?
  5. Accuracy, sensivity or specificity, which we need to be more focus on?
  6. Which model fit better?

Business understanding

The goal of this phase is to understand the project objetives and what is needed to be able to achieve them, then translate it into a data mining problem definition, which include a designed plan for analysis and aplication step by step.

Determine business objectives

In general, the banking business have two main goals:

  • The first, to offer variety of products or services that answers to individual and business customers needs;
  • The second, and the one most important for this project, collect payments from the products provided to their clients with the goal of generating incomes for shareholders.

Both objectives generate an organic balance operation between the customers needs and the gains that the company requires for its operations. However, in this analysis, we will focus on dimishinig the risk linked to the second goal. In other words, we are looking for a good model to forecast which client will have a higher risk of not being able to pay back a credit that has been granted to them.

Our main goal will be to try to minimize the losses that are given by the sum of the amounts of credit that are given to the people that are predicted to be positive (hence being eligible for a credit) but that should actually be negative, as they determine a risk for the company of not being able to pay back the amount they received.

We want to achieve the goal of having the losses being smaller than the 10% of the total amount of the credit that would be granted to the customers.

Goal: Losses < 10% amount of credit

We will determine it by considering that the company will grant a credit only to those who have a good credit score, and it will not otherwise.

Figure 2: Goal of a credit

Goal of a credit

Assess situation

Now we will take into consideration some assumptions and make the list of the requirements and constraints that the project could have.

Assess List
Assumptions 1) The team members have all the required skills.
2) The data is real.
Requirements 1) Boundaries of the work: identify the best model among at least 5
2) Submit a report with our findings.
3) FJFJF.
Constraints 1) We have about 7 weeks to complete the analysis.
2) The size of the data set.
3) Limited input variables

Determine data mining goals

The proposed mining goal for this analysis is to obtain a model which shows a good evaluation in terms if risk of not paying back given the information of a new person, and hence hepling to decide whether it is a good idea to grant them the credit they are requiring. This algorithm need to have a high accuracy, highlighting the negative impact of a false positive.

  1. Classification: Group variables that bring similar information. Potential creation of dummy variables and eliminate the variable which did not bring enough information.
  2. Prediction: Find the model that give us better prediccions and compare with the testing data.
  3. Optimization: Maximization of the sensitivity of the data.

Produce project plan

In order to meet the objectives we have made a gantt chart, the time span considered was divided into 6 weeks from November 2 to December 18.

Figure 3: Gantt Project

Gantt Project

We have followed the deadlines in order to obtain the corresponding feedback for each week.

Data Understanding

Going forward with the anaylisis, the goal in the second step is to have a first perception of the information brought by the data and create hypotheses about them. To be able to do so, we will develop the following points:

Collect initial data

The dataset was delivered together with the description of the task and is in csv format. It contains 1’000 observations, 1 each row, 30 input variables and 1 output variable. In addition to the dataset, we seek for more information in videos in youtube where we were mainly intended to familiarize with the operation by itself and, hence, indeep more into the variables that we consider that bring more information.

Describe data

In this point, we will examine the gross properties of the adquire data. Let’s start checking the stucture and size of it. As you can see below, there are 1’000 rows and 32 variables. The first column identifies all the observations taken into consideration with an unique ID (a number), the 30 following columns are the input variables, while the last one is the output variable, which gives the information regarding the person is a high risk (the credit is rejected) or not (the credit is accepted).

dim(data)
## [1] 1000   32

Now, let’s give a look to the summary and the structe of the data, including their statistical characteristics, for example, minimum, mean, maximum, so on.

Overview of the dataset (1000 observations):

Summary

Output variable: yes / no

As we can see from the graph, there is the majority of the observations which have a positive value (700 against 300).

Description by output variable

## 
##  Descriptive statistics by group 
## group: 0
##                  vars   n    mean      sd median trimmed     mad min   max
## OBS.                1 300  515.76  281.03  542.0  518.66  349.15   2   999
## CHK_ACCT            2 300    0.90    1.05    1.0    0.75    1.48   0     3
## DURATION            3 300   24.86   13.28   24.0   23.59   17.79   6    72
## HISTORY             4 300    2.17    1.08    2.0    2.19    0.00   0     4
## NEW_CAR             5 300    0.30    0.46    0.0    0.25    0.00   0     1
## USED_CAR            6 300    0.06    0.23    0.0    0.00    0.00   0     1
## FURNITURE           7 300    0.19    0.40    0.0    0.12    0.00   0     1
## RADIO.TV            8 300    0.21    0.41    0.0    0.13    0.00   0     1
## EDUCATION           9 300    0.07    0.26    0.0    0.00    0.00   0     1
## RETRAINING         10 300    0.11    0.32    0.0    0.02    0.00   0     1
## AMOUNT             11 300 3938.13 3535.82 2574.5 3291.18 2092.69 433 18424
## SAV_ACCT           12 300    0.67    1.30    0.0    0.34    0.00   0     4
## EMPLOYMENT         13 300    2.17    1.22    2.0    2.18    1.48   0     4
## INSTALL_RATE       14 300    3.10    1.09    4.0    3.25    0.00   1     4
## MALE_DIV           15 300    0.07    0.25    0.0    0.00    0.00   0     1
## MALE_SINGLE        16 300    0.49    0.50    0.0    0.48    0.00   0     1
## MALE_MAR_or_WID    17 300    0.08    0.28    0.0    0.00    0.00   0     1
## CO.APPLICANT       18 300    0.06    0.24    0.0    0.00    0.00   0     1
## GUARANTOR          19 300    0.03    0.18    0.0    0.00    0.00   0     1
## PRESENT_RESIDENT   20 300    2.85    1.09    3.0    2.94    1.48   1     4
## REAL_ESTATE        21 300    0.20    0.40    0.0    0.12    0.00   0     1
## PROP_UNKN_NONE     22 300    0.22    0.42    0.0    0.15    0.00   0     1
## AGE                23 300   33.96   11.22   31.0   32.38    8.90  19    74
## OTHER_INSTALL      24 300    0.25    0.44    0.0    0.19    0.00   0     1
## RENT               25 300    0.23    0.42    0.0    0.17    0.00   0     1
## OWN_RES            26 300    0.62    0.49    1.0    0.65    0.00   0     1
## NUM_CREDITS        27 300    1.37    0.56    1.0    1.29    0.00   1     4
## JOB                28 300    1.94    0.67    2.0    1.95    0.00   0     3
## NUM_DEPENDENTS     29 300    1.15    0.36    1.0    1.07    0.00   1     2
## TELEPHONE          30 300    0.38    0.49    0.0    0.35    0.00   0     1
## FOREIGN            31 300    0.01    0.11    0.0    0.00    0.00   0     1
## RESPONSE           32 300    0.00    0.00    0.0    0.00    0.00   0     0
##                  range  skew kurtosis     se
## OBS.               997 -0.08    -1.13  16.23
## CHK_ACCT             3  0.99    -0.27   0.06
## DURATION            66  0.83     0.03   0.77
## HISTORY              4  0.07    -0.09   0.06
## NEW_CAR              1  0.89    -1.22   0.03
## USED_CAR             1  3.82    12.60   0.01
## FURNITURE            1  1.55     0.39   0.02
## RADIO.TV             1  1.44     0.08   0.02
## EDUCATION            1  3.26     8.64   0.02
## RETRAINING           1  2.43     3.91   0.02
## AMOUNT           17991  1.57     2.05 204.14
## SAV_ACCT             4  1.83     1.82   0.08
## EMPLOYMENT           4  0.12    -0.96   0.07
## INSTALL_RATE         3 -0.72    -0.97   0.06
## MALE_DIV             1  3.46     9.98   0.01
## MALE_SINGLE          1  0.05    -2.00   0.03
## MALE_MAR_or_WID      1  3.00     7.02   0.02
## CO.APPLICANT         1  3.69    11.63   0.01
## GUARANTOR            1  5.17    24.85   0.01
## PRESENT_RESIDENT     3 -0.25    -1.40   0.06
## REAL_ESTATE          1  1.49     0.23   0.02
## PROP_UNKN_NONE       1  1.32    -0.25   0.02
## AGE                 55  1.14     0.73   0.65
## OTHER_INSTALL        1  1.13    -0.73   0.03
## RENT                 1  1.25    -0.43   0.02
## OWN_RES              1 -0.49    -1.76   0.03
## NUM_CREDITS          3  1.45     2.34   0.03
## JOB                  3 -0.40     0.44   0.04
## NUM_DEPENDENTS       1  1.91     1.67   0.02
## TELEPHONE            1  0.51    -1.75   0.03
## FOREIGN              1  8.44    69.53   0.01
## RESPONSE             0   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 1
##                  vars   n    mean      sd median trimmed     mad min   max
## OBS.                1 700  493.96  292.05  482.5  492.61  377.32   1  1000
## CHK_ACCT            2 700    1.87    1.23    2.0    1.96    1.48   0     3
## DURATION            3 700   19.21   11.08   18.0   17.88    8.90   4    60
## HISTORY             4 700    2.71    1.04    2.0    2.72    0.00   0     4
## NEW_CAR             5 700    0.21    0.41    0.0    0.13    0.00   0     1
## USED_CAR            6 700    0.12    0.33    0.0    0.03    0.00   0     1
## FURNITURE           7 700    0.18    0.38    0.0    0.09    0.00   0     1
## RADIO.TV            8 700    0.31    0.46    0.0    0.26    0.00   0     1
## EDUCATION           9 700    0.04    0.20    0.0    0.00    0.00  -1     1
## RETRAINING         10 700    0.09    0.29    0.0    0.00    0.00   0     1
## AMOUNT             11 700 2985.46 2401.47 2244.0 2564.20 1485.57 250 15857
## SAV_ACCT           12 700    1.29    1.65    0.0    1.11    0.00   0     4
## EMPLOYMENT         13 700    2.48    1.19    2.0    2.54    1.48   0     4
## INSTALL_RATE       14 700    2.92    1.13    3.0    3.02    1.48   1     4
## MALE_DIV           15 700    0.04    0.20    0.0    0.00    0.00   0     1
## MALE_SINGLE        16 700    0.57    0.49    1.0    0.59    0.00   0     1
## MALE_MAR_or_WID    17 700    0.10    0.29    0.0    0.00    0.00   0     1
## CO.APPLICANT       18 700    0.03    0.18    0.0    0.00    0.00   0     1
## GUARANTOR          19 700    0.06    0.25    0.0    0.00    0.00   0     2
## PRESENT_RESIDENT   20 700    2.84    1.11    3.0    2.93    1.48   1     4
## REAL_ESTATE        21 700    0.32    0.47    0.0    0.27    0.00   0     1
## PROP_UNKN_NONE     22 700    0.12    0.33    0.0    0.03    0.00   0     1
## AGE                23 700   36.30   11.77   34.0   34.92   10.38  19   125
## OTHER_INSTALL      24 700    0.16    0.36    0.0    0.07    0.00   0     1
## RENT               25 700    0.16    0.36    0.0    0.07    0.00   0     1
## OWN_RES            26 700    0.75    0.43    1.0    0.82    0.00   0     1
## NUM_CREDITS        27 700    1.42    0.58    1.0    1.35    0.00   1     4
## JOB                28 700    1.89    0.65    2.0    1.89    0.00   0     3
## NUM_DEPENDENTS     29 700    1.16    0.36    1.0    1.07    0.00   1     2
## TELEPHONE          30 700    0.42    0.49    0.0    0.39    0.00   0     1
## FOREIGN            31 700    0.05    0.21    0.0    0.00    0.00   0     1
## RESPONSE           32 700    1.00    0.00    1.0    1.00    0.00   1     1
##                  range  skew kurtosis    se
## OBS.               999  0.04    -1.23 11.04
## CHK_ACCT             3 -0.39    -1.53  0.05
## DURATION            56  1.18     1.38  0.42
## HISTORY              4  0.00    -0.90  0.04
## NEW_CAR              1  1.44     0.08  0.02
## USED_CAR             1  2.29     3.26  0.01
## FURNITURE            1  1.70     0.89  0.01
## RADIO.TV             1  0.81    -1.34  0.02
## EDUCATION            2  4.31    20.27  0.01
## RETRAINING           1  2.86     6.18  0.01
## AMOUNT           15607  1.94     4.62 90.77
## SAV_ACCT             4  0.76    -1.16  0.06
## EMPLOYMENT           4 -0.22    -0.87  0.04
## INSTALL_RATE         3 -0.45    -1.29  0.04
## MALE_DIV             1  4.50    18.32  0.01
## MALE_SINGLE          1 -0.30    -1.91  0.02
## MALE_MAR_or_WID      1  2.74     5.53  0.01
## CO.APPLICANT         1  5.23    25.39  0.01
## GUARANTOR            2  3.93    14.88  0.01
## PRESENT_RESIDENT     3 -0.28    -1.38  0.04
## REAL_ESTATE          1  0.78    -1.39  0.02
## PROP_UNKN_NONE       1  2.27     3.17  0.01
## AGE                106  1.43     4.51  0.45
## OTHER_INSTALL        1  1.88     1.54  0.01
## RENT                 1  1.89     1.59  0.01
## OWN_RES              1 -1.17    -0.63  0.02
## NUM_CREDITS          3  1.20     1.30  0.02
## JOB                  3 -0.37     0.50  0.02
## NUM_DEPENDENTS       1  1.89     1.59  0.01
## TELEPHONE            1  0.34    -1.89  0.02
## FOREIGN              1  4.26    16.21  0.01
## RESPONSE             0   NaN      NaN  0.00

As we see in the table shown earlier, all the values are integers and there is not any missing value. In addition, we can see some inconsistencies of the variables with the initial description, that we will explain in following table.

Variable Description Inconsistencies
CHK_ACCT C: 0, 1, 2, 3 X
DURATION Numerical X
HISTORY C: 0, 1, 2, 3, 4 X
NEW_CAR B: 0, 1 X
USED_CAR B: 0, 1 X
FURNITURE B: 0, 1 X
RADIO.TV B: 0, 1 X
EDUCATION B: 0, 1 ✔: Acoording to the description we should have a binary variable and the data show -1
RETRAINING B: 0, 1 X
AMOUNT Numerical X
SAV_ACCT C: 0, 1, 2, 3, 4 X
EMPLOYMENT C: 0, 1, 2, 3, 4 X
INSTALL_RATE Numerical X
MALE_DIV B: 0, 1 X
MALE_SINGLE B: 0, 1 X
MALE_MAR_WID B: 0, 1 X
CO-APPLICANT B: 0, 1 X
GUARANTOR B: 0, 1 ✔: Acoording to the description we should have a binary variable and the data show a 2
PRESENT_RESIDENT C: 0, 1, 2, 3 ✔: Acoording to the description we should 3 categories instead of 4 shown by the data
REAL_ESTATE B: 0, 1 X
PROP_UNKN_NONE B: 0, 1 X
AGE Numerical ✔: Identify outliers, the age should not go up to 125 years
OTHER_INSTALL B: 0, 1 X
RENT B: 0, 1 X
OWN_RES B: 0, 1 X
NUM_CREDITS Numerical X
JOB C: 0, 1, 2, 3 X
NUM_DEPENDENTS Numerical X
TELEPHONE B: 0, 1 X
FOREIGN B: 0, 1 X

Then, for the 3 inconsistencies found before, We have established the following hypothesis and solutions.

  • EDUCATION: there was an error in the registration of the information and the -1 must be replaced by 1.
  • GUARANTOR: there was an error in the registration of the information and the 2 must be replaced by 1.
  • PRESENT_RESIDENT: Again, there was an error in the registration of the information and the register was made in years instead of the category.
  • AGE: Here it is clear that we have some outliers, we should limit the age to 75.

The corrections will be made in the next section.

It is also important to mention that the output variable shows in 70% that the credit is accepted and in 30% rejected, which can later bias the prediction.

Explore data

In this section, we will go deeper into the data and look for patterns or relationships between variables. To be able to do it, we will develop an histogram to check the distribution of our data for each variable.

Histogram

All variables

Of independent variables grouped by response

## NULL

Regarding the last charts, we have the following observations:

  • We can see that the numerical variable have a normal distribution asymmetric to the right (Skewed (Non-Normal) Right), possibly because we have a more significant lower bound.
  • It is very difficult to indentify patterns in the data.
  • In the second one, histograms by response variable, we are able to see the proportion within the positives and negatives answers applications.

Boxplot

So now, that we check the distribution of the variables, let’s move on to the evaluation of their quartiles.

All variables

Of independent variables grouped by response

In the following table, you will find our principal observations.

Boxplot Observation
For each variable We indentify that some variable could be mutually exclusive between them. We can evaluate the formation of the following groups:
1) Agregation of the varibles purpose of the credit
2) Agregation of the male variables as a categorical one.
3) Agregation of the REAL_ESTATE and PROP_UNKN_NONE as a categorical one.
4) Agregation of the RENT and OWN_RES as a categorical one.
plot by response The variables which stand out more are: CHK_ACCT and EMPLOYMENT. We can observe that each box by group are different between each other.

Additionally, we could identify some outliers, show as red dots and, as you can see in the second chart, the input variables clustered by the output variable show that there are some features that bring more information than others.

key definitions:

  1. Real property includes the physical property (physical land, structures and resources attached to it) of the real estate, but it expands its definition to include other types of ownerships as rights. Meaning that we can have properties without real state (PROP_UNKN_NONE =0 & REAL_ESTATE=0)
  2. For the RENT and OWN_RES variable, we can see that we have 3 cases: 1-0, 0-1 and 0-0 because there exists the possibility that the person who apply for the loan do not owns the residence and neither is in charge of any rent.

Now, we will give a look to the correlation between variables.

Correlation

The variables which are more correlated are the following:

  • History and Number of credits positively correlated between them.
  • Duration and Amount positively correlated.
  • Response variable and check account positively correlated.

In the model section we will evaluate the coefficients of the variables and continue this analysis in greater depth.

Verify data quality

To be able to do so, we establish 3 questions that we will address during the resolution of this step:

  1. Is the data complete (does it cover all the cases required)?
  2. Is it correct or does it contain any error?
  3. Are there missing values in the data? If so how are they represented?

Overall dataset

By variable

## [1] 0
## [1] 1

## # A tibble: 1 x 4
##   type      cnt  pcnt col_name    
##   <chr>   <int> <dbl> <named list>
## 1 integer    31   100 <chr [31]>

Below you will find the answers of the 3 questions:

Question Answer
1 Yes, all the columns and rows contains information.
2 No, they contain some errors. They were found in data description and exploratory and are the following:
1) The variable Education shows an output that is not binary.
2) The variable PRESENT_RESIDENT have more categories than those that were mentioned in the description
3) the variable AGE is out of range.
4) AMOUNT, DURANTION, AGE are the variables with the highest quantity of outliers.
3 No, there is not any missing values in the data.

In addition, we are going to anaylise if the agregation mentioned in the boxplot for each variable is possible, to be able to do so, we are going to apply the Chi-Square Test to measure the independence between them. Next, we are going to explain the steps taking for each analysis:

1) Variables: REAL_ESTATE and PROP_UNKN_NONE

First, we establish the hypotheses:

H0: The REAL_ESTATE and PROP_UNKN_NONE are independent variables.

against the bilateral alternative:

H1: They are not independent.

For the chi-squared test to be valid, the following conditions must be true:

  1. The sampling method is random
  2. The variables considered are categorical
  3. Size: all levels have more than 5 expected events.

Assuptions: Significance level of 0.05 Clarifications: The p-value is the probability that a chi-square statistic having X degrees of freedom is more extreme than \(X^2\).

Finally, we are going to accept or reject the hypotheses checking the p-value. If the p-value is less than the significance level that we have established earlier we reject the null hypothesis. Thus, we conclude that there is a relationship between the variables.

2) RENT and OWN_RES

First, we stablish the hypotheses:

H0: The RENT and OWN_RES are independent variables.

against the bilateral alternative:

H1: They are not independent.

Then we move on with the same process that we mentioned above.

The analysis can be found in the next chapter.

Data Understanding

Going forward with the anaylisis, the goal in the second step is to have a first perception of the information brought by the data and create hypotheses about them. To be able to do so, we will develop the following points:

Collect initial data

The dataset was delivered together with the description of the task and is in csv format. It contains 1’000 observations, 1 each row, 30 input variables and 1 output variable. In addition to the dataset, we seek for more information in videos in youtube where we were mainly intended to familiarize with the operation by itself and, hence, indeep more into the variables that we consider that bring more information.

Describe data

In this point, we will examine the gross properties of the adquire data. Let’s start checking the stucture and size of it. As you can see below, there are 1’000 rows and 32 variables. The first column identifies all the observations taken into consideration with an unique ID (a number), the 30 following columns are the input variables, while the last one is the output variable, which gives the information regarding the person is a high risk (the credit is rejected) or not (the credit is accepted).

dim(data)
## [1] 1000   32

Now, let’s give a look to the summary and the structe of the data, including their statistical characteristics, for example, minimum, mean, maximum, so on.

Overview of the dataset (1000 observations):

Summary

Output variable: yes / no

As we can see from the graph, there is the majority of the observations which have a positive value (700 against 300).

Description by output variable

## 
##  Descriptive statistics by group 
## group: 0
##                  vars   n    mean      sd median trimmed     mad min   max
## OBS.                1 300  515.76  281.03  542.0  518.66  349.15   2   999
## CHK_ACCT            2 300    0.90    1.05    1.0    0.75    1.48   0     3
## DURATION            3 300   24.86   13.28   24.0   23.59   17.79   6    72
## HISTORY             4 300    2.17    1.08    2.0    2.19    0.00   0     4
## NEW_CAR             5 300    0.30    0.46    0.0    0.25    0.00   0     1
## USED_CAR            6 300    0.06    0.23    0.0    0.00    0.00   0     1
## FURNITURE           7 300    0.19    0.40    0.0    0.12    0.00   0     1
## RADIO.TV            8 300    0.21    0.41    0.0    0.13    0.00   0     1
## EDUCATION           9 300    0.07    0.26    0.0    0.00    0.00   0     1
## RETRAINING         10 300    0.11    0.32    0.0    0.02    0.00   0     1
## AMOUNT             11 300 3938.13 3535.82 2574.5 3291.18 2092.69 433 18424
## SAV_ACCT           12 300    0.67    1.30    0.0    0.34    0.00   0     4
## EMPLOYMENT         13 300    2.17    1.22    2.0    2.18    1.48   0     4
## INSTALL_RATE       14 300    3.10    1.09    4.0    3.25    0.00   1     4
## MALE_DIV           15 300    0.07    0.25    0.0    0.00    0.00   0     1
## MALE_SINGLE        16 300    0.49    0.50    0.0    0.48    0.00   0     1
## MALE_MAR_or_WID    17 300    0.08    0.28    0.0    0.00    0.00   0     1
## CO.APPLICANT       18 300    0.06    0.24    0.0    0.00    0.00   0     1
## GUARANTOR          19 300    0.03    0.18    0.0    0.00    0.00   0     1
## PRESENT_RESIDENT   20 300    2.85    1.09    3.0    2.94    1.48   1     4
## REAL_ESTATE        21 300    0.20    0.40    0.0    0.12    0.00   0     1
## PROP_UNKN_NONE     22 300    0.22    0.42    0.0    0.15    0.00   0     1
## AGE                23 300   33.96   11.22   31.0   32.38    8.90  19    74
## OTHER_INSTALL      24 300    0.25    0.44    0.0    0.19    0.00   0     1
## RENT               25 300    0.23    0.42    0.0    0.17    0.00   0     1
## OWN_RES            26 300    0.62    0.49    1.0    0.65    0.00   0     1
## NUM_CREDITS        27 300    1.37    0.56    1.0    1.29    0.00   1     4
## JOB                28 300    1.94    0.67    2.0    1.95    0.00   0     3
## NUM_DEPENDENTS     29 300    1.15    0.36    1.0    1.07    0.00   1     2
## TELEPHONE          30 300    0.38    0.49    0.0    0.35    0.00   0     1
## FOREIGN            31 300    0.01    0.11    0.0    0.00    0.00   0     1
## RESPONSE           32 300    0.00    0.00    0.0    0.00    0.00   0     0
##                  range  skew kurtosis     se
## OBS.               997 -0.08    -1.13  16.23
## CHK_ACCT             3  0.99    -0.27   0.06
## DURATION            66  0.83     0.03   0.77
## HISTORY              4  0.07    -0.09   0.06
## NEW_CAR              1  0.89    -1.22   0.03
## USED_CAR             1  3.82    12.60   0.01
## FURNITURE            1  1.55     0.39   0.02
## RADIO.TV             1  1.44     0.08   0.02
## EDUCATION            1  3.26     8.64   0.02
## RETRAINING           1  2.43     3.91   0.02
## AMOUNT           17991  1.57     2.05 204.14
## SAV_ACCT             4  1.83     1.82   0.08
## EMPLOYMENT           4  0.12    -0.96   0.07
## INSTALL_RATE         3 -0.72    -0.97   0.06
## MALE_DIV             1  3.46     9.98   0.01
## MALE_SINGLE          1  0.05    -2.00   0.03
## MALE_MAR_or_WID      1  3.00     7.02   0.02
## CO.APPLICANT         1  3.69    11.63   0.01
## GUARANTOR            1  5.17    24.85   0.01
## PRESENT_RESIDENT     3 -0.25    -1.40   0.06
## REAL_ESTATE          1  1.49     0.23   0.02
## PROP_UNKN_NONE       1  1.32    -0.25   0.02
## AGE                 55  1.14     0.73   0.65
## OTHER_INSTALL        1  1.13    -0.73   0.03
## RENT                 1  1.25    -0.43   0.02
## OWN_RES              1 -0.49    -1.76   0.03
## NUM_CREDITS          3  1.45     2.34   0.03
## JOB                  3 -0.40     0.44   0.04
## NUM_DEPENDENTS       1  1.91     1.67   0.02
## TELEPHONE            1  0.51    -1.75   0.03
## FOREIGN              1  8.44    69.53   0.01
## RESPONSE             0   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 1
##                  vars   n    mean      sd median trimmed     mad min   max
## OBS.                1 700  493.96  292.05  482.5  492.61  377.32   1  1000
## CHK_ACCT            2 700    1.87    1.23    2.0    1.96    1.48   0     3
## DURATION            3 700   19.21   11.08   18.0   17.88    8.90   4    60
## HISTORY             4 700    2.71    1.04    2.0    2.72    0.00   0     4
## NEW_CAR             5 700    0.21    0.41    0.0    0.13    0.00   0     1
## USED_CAR            6 700    0.12    0.33    0.0    0.03    0.00   0     1
## FURNITURE           7 700    0.18    0.38    0.0    0.09    0.00   0     1
## RADIO.TV            8 700    0.31    0.46    0.0    0.26    0.00   0     1
## EDUCATION           9 700    0.04    0.20    0.0    0.00    0.00  -1     1
## RETRAINING         10 700    0.09    0.29    0.0    0.00    0.00   0     1
## AMOUNT             11 700 2985.46 2401.47 2244.0 2564.20 1485.57 250 15857
## SAV_ACCT           12 700    1.29    1.65    0.0    1.11    0.00   0     4
## EMPLOYMENT         13 700    2.48    1.19    2.0    2.54    1.48   0     4
## INSTALL_RATE       14 700    2.92    1.13    3.0    3.02    1.48   1     4
## MALE_DIV           15 700    0.04    0.20    0.0    0.00    0.00   0     1
## MALE_SINGLE        16 700    0.57    0.49    1.0    0.59    0.00   0     1
## MALE_MAR_or_WID    17 700    0.10    0.29    0.0    0.00    0.00   0     1
## CO.APPLICANT       18 700    0.03    0.18    0.0    0.00    0.00   0     1
## GUARANTOR          19 700    0.06    0.25    0.0    0.00    0.00   0     2
## PRESENT_RESIDENT   20 700    2.84    1.11    3.0    2.93    1.48   1     4
## REAL_ESTATE        21 700    0.32    0.47    0.0    0.27    0.00   0     1
## PROP_UNKN_NONE     22 700    0.12    0.33    0.0    0.03    0.00   0     1
## AGE                23 700   36.30   11.77   34.0   34.92   10.38  19   125
## OTHER_INSTALL      24 700    0.16    0.36    0.0    0.07    0.00   0     1
## RENT               25 700    0.16    0.36    0.0    0.07    0.00   0     1
## OWN_RES            26 700    0.75    0.43    1.0    0.82    0.00   0     1
## NUM_CREDITS        27 700    1.42    0.58    1.0    1.35    0.00   1     4
## JOB                28 700    1.89    0.65    2.0    1.89    0.00   0     3
## NUM_DEPENDENTS     29 700    1.16    0.36    1.0    1.07    0.00   1     2
## TELEPHONE          30 700    0.42    0.49    0.0    0.39    0.00   0     1
## FOREIGN            31 700    0.05    0.21    0.0    0.00    0.00   0     1
## RESPONSE           32 700    1.00    0.00    1.0    1.00    0.00   1     1
##                  range  skew kurtosis    se
## OBS.               999  0.04    -1.23 11.04
## CHK_ACCT             3 -0.39    -1.53  0.05
## DURATION            56  1.18     1.38  0.42
## HISTORY              4  0.00    -0.90  0.04
## NEW_CAR              1  1.44     0.08  0.02
## USED_CAR             1  2.29     3.26  0.01
## FURNITURE            1  1.70     0.89  0.01
## RADIO.TV             1  0.81    -1.34  0.02
## EDUCATION            2  4.31    20.27  0.01
## RETRAINING           1  2.86     6.18  0.01
## AMOUNT           15607  1.94     4.62 90.77
## SAV_ACCT             4  0.76    -1.16  0.06
## EMPLOYMENT           4 -0.22    -0.87  0.04
## INSTALL_RATE         3 -0.45    -1.29  0.04
## MALE_DIV             1  4.50    18.32  0.01
## MALE_SINGLE          1 -0.30    -1.91  0.02
## MALE_MAR_or_WID      1  2.74     5.53  0.01
## CO.APPLICANT         1  5.23    25.39  0.01
## GUARANTOR            2  3.93    14.88  0.01
## PRESENT_RESIDENT     3 -0.28    -1.38  0.04
## REAL_ESTATE          1  0.78    -1.39  0.02
## PROP_UNKN_NONE       1  2.27     3.17  0.01
## AGE                106  1.43     4.51  0.45
## OTHER_INSTALL        1  1.88     1.54  0.01
## RENT                 1  1.89     1.59  0.01
## OWN_RES              1 -1.17    -0.63  0.02
## NUM_CREDITS          3  1.20     1.30  0.02
## JOB                  3 -0.37     0.50  0.02
## NUM_DEPENDENTS       1  1.89     1.59  0.01
## TELEPHONE            1  0.34    -1.89  0.02
## FOREIGN              1  4.26    16.21  0.01
## RESPONSE             0   NaN      NaN  0.00

As we see in the table shown earlier, all the values are integers and there is not any missing value. In addition, we can see some inconsistencies of the variables with the initial description, that we will explain in following table.

Variable Description Inconsistencies
CHK_ACCT C: 0, 1, 2, 3 X
DURATION Numerical X
HISTORY C: 0, 1, 2, 3, 4 X
NEW_CAR B: 0, 1 X
USED_CAR B: 0, 1 X
FURNITURE B: 0, 1 X
RADIO.TV B: 0, 1 X
EDUCATION B: 0, 1 ✔: Acoording to the description we should have a binary variable and the data show -1
RETRAINING B: 0, 1 X
AMOUNT Numerical X
SAV_ACCT C: 0, 1, 2, 3, 4 X
EMPLOYMENT C: 0, 1, 2, 3, 4 X
INSTALL_RATE Numerical X
MALE_DIV B: 0, 1 X
MALE_SINGLE B: 0, 1 X
MALE_MAR_WID B: 0, 1 X
CO-APPLICANT B: 0, 1 X
GUARANTOR B: 0, 1 ✔: Acoording to the description we should have a binary variable and the data show a 2
PRESENT_RESIDENT C: 0, 1, 2, 3 ✔: Acoording to the description we should 3 categories instead of 4 shown by the data
REAL_ESTATE B: 0, 1 X
PROP_UNKN_NONE B: 0, 1 X
AGE Numerical ✔: Identify outliers, the age should not go up to 125 years
OTHER_INSTALL B: 0, 1 X
RENT B: 0, 1 X
OWN_RES B: 0, 1 X
NUM_CREDITS Numerical X
JOB C: 0, 1, 2, 3 X
NUM_DEPENDENTS Numerical X
TELEPHONE B: 0, 1 X
FOREIGN B: 0, 1 X

Then, for the 3 inconsistencies found before, We have established the following hypothesis and solutions.

  • EDUCATION: there was an error in the registration of the information and the -1 must be replaced by 1.
  • GUARANTOR: there was an error in the registration of the information and the 2 must be replaced by 1.
  • PRESENT_RESIDENT: Again, there was an error in the registration of the information and the register was made in years instead of the category.
  • AGE: Here it is clear that we have some outliers, we should limit the age to 75.

The corrections will be made in the next section.

It is also important to mention that the output variable shows in 70% that the credit is accepted and in 30% rejected, which can later bias the prediction.

Explore data

In this section, we will go deeper into the data and look for patterns or relationships between variables. To be able to do it, we will develop an histogram to check the distribution of our data for each variable.

Histogram

All variables

Of independent variables grouped by response

## NULL

Regarding the last charts, we have the following observations:

  • We can see that the numerical variable have a normal distribution asymmetric to the right (Skewed (Non-Normal) Right), possibly because we have a more significant lower bound.
  • It is very difficult to indentify patterns in the data.
  • In the second one, histograms by response variable, we are able to see the proportion within the positives and negatives answers applications.

Boxplot

So now, that we check the distribution of the variables, let’s move on to the evaluation of their quartiles.

All variables

Of independent variables grouped by response

In the following table, you will find our principal observations.

Boxplot Observation
For each variable We indentify that some variable could be mutually exclusive between them. We can evaluate the formation of the following groups:
1) Agregation of the varibles purpose of the credit
2) Agregation of the male variables as a categorical one.
3) Agregation of the REAL_ESTATE and PROP_UNKN_NONE as a categorical one.
4) Agregation of the RENT and OWN_RES as a categorical one.
plot by response The variables which stand out more are: CHK_ACCT and EMPLOYMENT. We can observe that each box by group are different between each other.

Additionally, we could identify some outliers, show as red dots and, as you can see in the second chart, the input variables clustered by the output variable show that there are some features that bring more information than others.

key definitions:

  1. Real property includes the physical property (physical land, structures and resources attached to it) of the real estate, but it expands its definition to include other types of ownerships as rights. Meaning that we can have properties without real state (PROP_UNKN_NONE =0 & REAL_ESTATE=0)
  2. For the RENT and OWN_RES variable, we can see that we have 3 cases: 1-0, 0-1 and 0-0 because there exists the possibility that the person who apply for the loan do not owns the residence and neither is in charge of any rent.

Now, we will give a look to the correlation between variables.

Correlation

The variables which are more correlated are the following:

  • History and Number of credits positively correlated between them.
  • Duration and Amount positively correlated.
  • Response variable and check account positively correlated.

In the model section we will evaluate the coefficients of the variables and continue this analysis in greater depth.

Verify data quality

To be able to do so, we establish 3 questions that we will address during the resolution of this step:

  1. Is the data complete (does it cover all the cases required)?
  2. Is it correct or does it contain any error?
  3. Are there missing values in the data? If so how are they represented?

Overall dataset

By variable

## [1] 0
## [1] 1

## # A tibble: 1 x 4
##   type      cnt  pcnt col_name    
##   <chr>   <int> <dbl> <named list>
## 1 integer    31   100 <chr [31]>

Below you will find the answers of the 3 questions:

Question Answer
1 Yes, all the columns and rows contains information.
2 No, they contain some errors. They were found in data description and exploratory and are the following:
1) The variable Education shows an output that is not binary.
2) The variable PRESENT_RESIDENT have more categories than those that were mentioned in the description
3) the variable AGE is out of range.
4) AMOUNT, DURANTION, AGE are the variables with the highest quantity of outliers.
3 No, there is not any missing values in the data.

In addition, we are going to anaylise if the agregation mentioned in the boxplot for each variable is possible, to be able to do so, we are going to apply the Chi-Square Test to measure the independence between them. Next, we are going to explain the steps taking for each analysis:

1) Variables: REAL_ESTATE and PROP_UNKN_NONE

First, we establish the hypotheses:

H0: The REAL_ESTATE and PROP_UNKN_NONE are independent variables.

against the bilateral alternative:

H1: They are not independent.

For the chi-squared test to be valid, the following conditions must be true:

  1. The sampling method is random
  2. The variables considered are categorical
  3. Size: all levels have more than 5 expected events.

Assuptions: Significance level of 0.05 Clarifications: The p-value is the probability that a chi-square statistic having X degrees of freedom is more extreme than \(X^2\).

Finally, we are going to accept or reject the hypotheses checking the p-value. If the p-value is less than the significance level that we have established earlier we reject the null hypothesis. Thus, we conclude that there is a relationship between the variables.

2) RENT and OWN_RES

First, we stablish the hypotheses:

H0: The RENT and OWN_RES are independent variables.

against the bilateral alternative:

H1: They are not independent.

Then we move on with the same process that we mentioned above.

The analysis can be found in the next chapter.

Exploratory data analysis

Clean data

Task Output
Raise the data quality to the level required by the selected analysis techniques. This may involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by modeling. Describe what decisions and actions were taken to address the data quality problems reported during the verify data quality task of the data understanding phase. Transformations of the data for cleaning purposes and the possible impact on the analysis results should be considered.

Reconsider how to deal with observed type of noise

We will consider how to correct the inconsistencies we have found in the previous chapter, which are on four different variables, namely:

  • EDUCATION: there was an error in the registration of the information and the -1 must be replaced by 1.
  • GUARANTOR: there was an error in the registration of the information and the 2 must be replaced by 1.
  • PRESENT_RESIDENT: Again, there was an error in the registration of the information and the register was made in years instead of the category.
  • AGE: Here it is clear that we have some outliers, we should limit the age to 75.

We have already decided how to correct them, hence we will move on to that direction.

Correct, remove or ignore noise

We will start by correcting the noises of EDUCATION, GUARANTOR and PRESENT_RESIDENT, by simply replacing the value -1 and 2 with value 1 for the first two and by changing the numbers of the categories for the latter by dimishing each value by 1, this is going to give us the true value corresponding to the data description that you can find in the appendix.

#EDUCATION
data %<>% 
  mutate(EDUCATION = replace(EDUCATION, EDUCATION == -1, 1))

#EDUCATION
data %<>% 
  mutate(GUARANTOR = replace(GUARANTOR, GUARANTOR == 2, 1))

#PRESENT_RESIDENT 
data %<>% 
  mutate(PRESENT_RESIDENT = PRESENT_RESIDENT - 1)

Decide how to deal with special values and their meaning

This is specifically for the case of AGE. As previously said, we believe that the 125 age is an error, hence we will discard it by selecting only the observation with value lower then 76 (as 75 is the second highest value).

#AGE
data %<>% 
  filter(AGE < 76)

Construct data

Task Output
This task includes constructive data preparation operations such as the production of derived attributes, entire new records or transformed values for existing attributes. Derived attributes are new attributes that are constructed from one or more existing attributes in the same record. Examples:
area = length * width. Describe the creation of completely new records.
Example: create records for customers who made no purchase during the past year. There was no reason to have such records in the raw data, but for modeling purposes it might make sense to explicitly represent the fact that certain customers made zero purchases.

Check available constuction mechanisms

As we already mentioned in the previous chapter, we will be able to create 4 different variables: 1. A binary variable decribing the sex of the person (male vs. female) 2. A categorical variable for the purpose of the credit 3. A categorical variable describing the real estate situation of the person (i.e. if someone owns a residence) 4. A categorical variable describing the property situation of the person (i.e. it they own their residence, are renting or something else)

Sex variable

We will start by the variable describing the sex of the considered person: this variable will be created thanks to the MALE_DIV, MALE_SINGLE and MALE_MAR_WID variables, and it will be a binary taking value 1 if the person is male, and value 0 if they are female.

More specifically, if either one of the variables used to construct the new one has value 1, so will the SEX_MALE variable, otherwise it will have value 0.

data %<>% 
  mutate(SEX_MALE = ifelse((MALE_DIV | MALE_SINGLE | MALE_MAR_or_WID) == 1, 1, 0)) %>% 
  mutate(SEX_MALE = as.factor(SEX_MALE))

We will now explore a bit the new variable we have created, by looking at the number of instances for each category and how it is affected in terms of response variable.

#Respesentation of SEX_MALE per value
data %>% 
  ggplot(aes(SEX_MALE)) + 
  geom_bar(aes(fill = factor(SEX_MALE))) + 
  theme(legend.position = "none")  + 
  geom_label(stat = 'count', aes(label =..count..)) 

#Representation of output variable in terms of SEX_MALE
data %>% 
  ggplot(aes(RESPONSE)) + 
  geom_bar(aes(fill = factor(SEX_MALE)), position = "dodge")+ 
  labs(color = "", fill = "SEX_MALE", x = "RESPONSE", y = "count") 

We can see from the first graph that we have more observation with a positive value for the SEX_MALE variable (690 vs. 309), meaning that there are more men than women in the dataset.

Moreover, thanks to the second graph, we can see a difference on the positive value for the response having a male rather than a female, but this could also be due to the fact that the presence of male is higher with respect to female.

Collide variables

We will now move on to the other variables aforementioned, so that instead of having multiple dummy variables, we have a factor variables with multiple levels.

Purpose

Let’s start with the purpose of credit.

This variable will take the following values: 1 = the purpose for the credit was a new car 2 = the purpose for the credit was a used car 3 = the purpose for the credit was funriture 4 = the purpose for the credit was a radio or a television 5 = the purpose for the credit was to increase eductation 6 = the purpose for the credit was a retraining 0 = the purpose for the credit was something else

It will be created by taking the respective value each time the dummy corresponding to one purpose takes value 1, if none of them has value 1, then the PURPOSEwill take value 0.

data %<>% 
  mutate(PURPOSE = ifelse(NEW_CAR == 1, 1, 
                          ifelse(USED_CAR == 1, 2, 
                                 ifelse(FURNITURE == 1, 3, 
                                        ifelse(RADIO.TV == 1, 4, 
                                               ifelse(EDUCATION == 1, 5, 
                                                      ifelse(RETRAINING == 1, 6, 0))))))) %>% 
  mutate(PURPOSE = as.factor(PURPOSE))

Let’s have a look at the new variable, in terms of number of observation per level and its link to the response variable.

data %>% 
  ggplot(aes(PURPOSE)) + 
  geom_bar(aes(reorder(PURPOSE, -table(PURPOSE)[PURPOSE]), fill = PURPOSE)) +
  scale_fill_discrete(name = "PURPOSE", 
                      labels = c("OTHER", "NEW_CAR", "USED_CAR", "FURNITURE", 
                                 "RADIO/TV", "EDUCATION", "RETRAINING")) + 
  geom_label(stat = 'count', aes(label =..count..)) 

data %>% 
  ggplot(aes(RESPONSE)) + 
  geom_bar(aes(fill = factor(PURPOSE)), position = "dodge") + 
  labs(x = "RESPONSE", y = "count") +  
  scale_fill_discrete(name = "PURPOSE", 
                      labels = c("OTHER", "NEW_CAR", "USED_CAR", "FURNITURE", 
                                 "RADIO/TV", "EDUCATION", "RETRAINING")) + 
  theme_bw()

In the first graph, we can see that the majority of the observations are found in the purpose of getting a Radio or a TV, followed by a new car and then furniture, while the one that is the less present is the education purpose.

In terms of output variable, shown in the second graph, the highest differences can be found in the Radio/TV and new car, but this could be given by the fact that they are the purposes with the highest number of observations.

Property

Now we will create the property variable.

We will start by looking at if the two variables that we want to use (namely, REAL_ESTATE and PROP_UNKN_NONE) are connected and hence it makes sense to put them together.

In order to do so, as we previously mentioned, we will perform a chi-squared independence test.

chisq.test(data$REAL_ESTATE, data$PROP_UNKN_NONE)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$REAL_ESTATE and data$PROP_UNKN_NONE
## X-squared = 69.97, df = 1, p-value < 2.2e-16

We can see that the two variables are statistically significantly associated, as the p-value is really low, almost equal to 0, and hence is lower than the considered significance level of alpha = 5%.

We can conclude that it makes sense to merge the two variables into one factor variable, which will take value 1 if the person has a real estate, value 2 if the person is not known to have a property and value 0 otherwise.

data %<>% 
  mutate(PROPERTY = as.factor(ifelse(REAL_ESTATE == 1, 1, 
                                     ifelse(PROP_UNKN_NONE == 1, 2, 0))))

Let’s have a look also at this new variable, once again in terms of number of observations per level and if there is a difference of occurences given the output variables.

data %>% 
ggplot(aes(PROPERTY)) + geom_bar(aes(fill = PROPERTY)) + scale_fill_discrete(name = "PROPERTY", labels = c("OTHER", "REAL_ESTATE", "PROP_UNKN_NONE")) + geom_label(stat = 'count', aes(label =..count..))  

data %>% 
  ggplot(aes(RESPONSE)) + geom_bar(aes(fill = PROPERTY), position = "dodge") + scale_fill_discrete(name = "PROPERTY", labels = c("OTHER", "REAL_ESTATE", "PROP_UNKN_NONE"))

We can clearly see in the first graph that the majority of the observations does not have a clear value for the property, being equal to 0 (563 compared to the 282 of REAL_ESTATE and 154 of PROP_UNKN_NONE).

If we consider the response, hence the second graph, we cannot really see a difference from the person having a real estate or not having a property and having a credit rejected, while if they have a real estate it is more probable that they will get the credit compared to those who have not.

Residence

Now let’s look at the third variable that we wish to know if it is needed to be created.

This variable will be created using the RENT and OWN_RES variables to describe whether a person has a residence or not.

Let’s start once again by the chi-squared independence test.

chisq.test(data$RENT, data$OWN_RES)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$RENT and data$OWN_RES
## X-squared = 536.77, df = 1, p-value < 2.2e-16

Also here, we can conclude that the two variables are statistically significantly associated, as the p-value is really low, especially as it is lower than the significance level we have chosen of alpha being equal to 5%.

Hence, we will create the residence variable, which will take value 1 if the person is renting, value 2 if the person is owning their own residence and value 0 otherwise.

data %<>% 
  mutate(RESIDENCE = as.factor(ifelse(RENT == 1, 1, 
                                      ifelse(OWN_RES == 1, 2, 0))))

And let’s explore the new variable a little bit, in terms of number of observations per level and if there is a difference in the possibility to get a credit given this variable.

data %>% 
  ggplot(aes(RESIDENCE)) + geom_bar(aes(fill = RESIDENCE)) + scale_fill_discrete(name = "RESIDENCE", labels = c("OTHER", "RENT", "OWN_RES"))+ geom_label(stat = 'count', aes(label =..count..)) 

data %>% 
  ggplot(aes(RESPONSE)) + geom_bar(aes(fill = RESIDENCE), position = "dodge") + scale_fill_discrete(name = "RESIDENCE", labels = c("OTHER", "RENT", "OWN_RES"))

In the first graph, we can see that the majority of the people in the sample do own their own residence (712 observations, compared to the 108 of other and 179 who are renting).

Looking at the second graph, comparing it to the response variable, we can see that owning the residence seems to have an impact on the possibility to get the credit, while renting seems not to have a major impact.

Integrate data

Task Output
These are methods whereby information is combined from multiple tables or records to create new records or values. Merging tables refers to joining together two or more tables that have different information about the same objects.
Merged data also covers aggregations. Aggregation refers to operations where new values are computed by summarizing together information from multiple records and/or tables.

Selecting variables we created and discard others

Here we integrate the variables we created in the dataset and we discard the ones we used to create them, so that we avoid the problem of multicollinearity.

We will need to drop also one of the variables we used to create the SEX_MALE variable, to avoid multicollinearity problems. The choice is on MALE_DIV.

We will also drop the identifier variable (OBS.) as it is not needed in the modelling part.

data_sel <- data %>%
                dplyr::select(CHK_ACCT, DURATION, HISTORY, PURPOSE,
                       AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE,
                       SEX_MALE, MALE_SINGLE, MALE_MAR_or_WID,
                       CO.APPLICANT, GUARANTOR, PRESENT_RESIDENT,
                       PROPERTY, AGE, OTHER_INSTALL, RESIDENCE,
                       NUM_CREDITS, JOB, TELEPHONE, RESPONSE) 

Select data

Task Output
Decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality and technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table. List the data to be included/excluded and the reasons for these decisions.

To further select the data we will use the correlation and we will run a simple linear model to have a look at which are the most important variables to be selected.

We start with the correlation and we use the basic dataset, because we cannot run a correlation on factor variables.

##                            V1
## PRESENT_RESIDENT -0.003059919
## NUM_DEPENDENTS    0.003296525
## MALE_MAR_or_WID   0.019844152
## FURNITURE        -0.020669253
## JOB              -0.033889427
## TELEPHONE         0.035704280
## RETRAINING       -0.035923923
## NUM_CREDITS       0.046215841
## MALE_DIV         -0.049924304
## GUARANTOR         0.055206089
## CO.APPLICANT     -0.062607640
## EDUCATION        -0.069954175
## INSTALL_RATE     -0.073052339
## MALE_SINGLE       0.081465268
## AGE               0.089413005
## RENT             -0.092509400
## NEW_CAR          -0.098268291
## USED_CAR          0.100040026
## RADIO.TV          0.107374760
## OTHER_INSTALL    -0.113009082
## EMPLOYMENT        0.117550263
## REAL_ESTATE       0.119759431
## PROP_UNKN_NONE   -0.125508812
## OWN_RES           0.134228850
## AMOUNT           -0.154366015
## SAV_ACCT          0.178079352
## DURATION         -0.214326399
## HISTORY           0.229192869
## CHK_ACCT          0.352022485

We can see that, in general, the correlation between the output variable and the explanatory variable is not particularly high, having a maximum of 0.35 with CHCK_ACC and a minimum of -0.00306 with PRESENTE_RESIDENT, in absolute terms.

We could decide to select only the variables having a correlation higher than a certain absolute value, however, as the difference among the correlations is not really high, we prefer not to make a selection here, and rather leave this decision to the modelling of a simple linear regression and a choice made on the AIC.

The Akaike information criterion (AIC) is a mathematical method for evaluating how well a model fits the data it was generated from. In statistics, AIC is used to compare different possible models and determine which one is the best fit for the data. source: https://www.scribbr.com/statistics/akaike-information-criterion/

The step function follows the idea that the variable that increases the AIC of the model the most will be discarder, up to the point in which it is not possible to decrease the AIC anymore.

Perform significance and correlation tests to decide what to include

set.seed(2143)
lm.sel <- glm(RESPONSE ~., data = data_sel)
lm.sel <- step(lm.sel, trace = 0)
summary(lm.sel) 
## 
## Call:
## glm(formula = RESPONSE ~ CHK_ACCT + DURATION + HISTORY + PURPOSE + 
##     AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_SINGLE + 
##     GUARANTOR + PROPERTY + OTHER_INSTALL + RESIDENCE + NUM_CREDITS + 
##     TELEPHONE, data = data_sel)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.05164  -0.31768   0.08993   0.28791   0.83553  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.944e-01  1.056e-01   6.576 7.88e-11 ***
## CHK_ACCT       9.306e-02  1.078e-02   8.632  < 2e-16 ***
## DURATION      -5.070e-03  1.453e-03  -3.489 0.000506 ***
## HISTORY        6.796e-02  1.363e-02   4.984 7.35e-07 ***
## PURPOSE1      -1.213e-01  6.016e-02  -2.017 0.043987 *  
## PURPOSE2       1.018e-01  6.854e-02   1.485 0.137931    
## PURPOSE3      -1.316e-02  6.218e-02  -0.212 0.832417    
## PURPOSE4       2.282e-03  5.959e-02   0.038 0.969465    
## PURPOSE5      -1.558e-01  7.902e-02  -1.972 0.048920 *  
## PURPOSE6      -1.599e-02  6.837e-02  -0.234 0.815149    
## AMOUNT        -1.719e-05  6.730e-06  -2.554 0.010801 *  
## SAV_ACCT       3.303e-02  8.344e-03   3.959 8.08e-05 ***
## EMPLOYMENT     2.125e-02  1.106e-02   1.920 0.055087 .  
## INSTALL_RATE  -4.665e-02  1.279e-02  -3.648 0.000278 ***
## MALE_SINGLE    7.415e-02  2.757e-02   2.690 0.007267 ** 
## GUARANTOR      1.724e-01  5.859e-02   2.943 0.003330 ** 
## PROPERTY1      4.181e-02  3.058e-02   1.367 0.171816    
## PROPERTY2     -9.679e-02  5.763e-02  -1.680 0.093348 .  
## OTHER_INSTALL -8.740e-02  3.315e-02  -2.637 0.008506 ** 
## RESIDENCE1    -1.280e-01  6.941e-02  -1.844 0.065457 .  
## RESIDENCE2    -5.565e-02  6.651e-02  -0.837 0.402923    
## NUM_CREDITS   -4.231e-02  2.489e-02  -1.700 0.089473 .  
## TELEPHONE      4.960e-02  2.734e-02   1.814 0.069942 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1577717)
## 
##     Null deviance: 209.91  on 998  degrees of freedom
## Residual deviance: 153.99  on 976  degrees of freedom
## AIC: 1015
## 
## Number of Fisher Scoring iterations: 2
data_sel <- lm.sel$model

Thanks to the AIC, we select the following variables: CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS and TELEPHONE, as they are the most significant. It is interesting to note that there are some levels of purpose that seem to be less relevant, more specifically the only one that are statistically significant are the first and the fifth. Moreover, we can see that there is coherence with the variables that had the highest correlations that we calculated before, hence we will use this method to make our final selection on the data.

Format data

Task Output
Formatting transformations refer to primarily syntactic modifications made to the data that do not change its meaning, but might be required by the modeling tool. Some tools have requirements on the order of the attributes, such as the first field being a unique identifier for each record or the last field being the outcome field the model is to predict. It might be important to change the order of the records in the dataset. Perhaps the modeling tool requires that the records be sorted according to the value of the outcome attribute. Additionally, there are purely syntactic changes made to satisfy the requirements of the specific modeling tool.

We will change the variables to factors for the dummies and the categorical variables, to have them corresponding to the description that has been given to us.

data_sel %<>% 
  mutate(
    CHK_ACCT = as.factor(CHK_ACCT),
    HISTORY = as.factor(HISTORY),
    SAV_ACCT = as.factor(SAV_ACCT),
    EMPLOYMENT = as.factor(EMPLOYMENT),
    MALE_SINGLE = as.factor(MALE_SINGLE), 
    GUARANTOR = as.factor(GUARANTOR),
    OTHER_INSTALL = as.factor(OTHER_INSTALL),
    TELEPHONE = as.factor(TELEPHONE),
    RESPONSE = as.factor(RESPONSE)
  )


str(data_sel)
## 'data.frame':    999 obs. of  16 variables:
##  $ RESPONSE     : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CHK_ACCT     : Factor w/ 4 levels "0","1","2","3": 1 2 4 1 1 4 4 2 4 2 ...
##  $ DURATION     : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ HISTORY      : Factor w/ 5 levels "0","1","2","3",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ PURPOSE      : Factor w/ 7 levels "0","1","2","3",..: 5 5 6 4 2 6 4 3 5 2 ...
##  $ AMOUNT       : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ SAV_ACCT     : Factor w/ 5 levels "0","1","2","3",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ EMPLOYMENT   : Factor w/ 5 levels "0","1","2","3",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ INSTALL_RATE : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ MALE_SINGLE  : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 1 1 ...
##  $ GUARANTOR    : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
##  $ PROPERTY     : Factor w/ 3 levels "0","1","2": 2 2 2 1 3 3 1 1 2 1 ...
##  $ OTHER_INSTALL: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RESIDENCE    : Factor w/ 3 levels "0","1","2": 3 3 3 1 1 1 3 2 3 3 ...
##  $ NUM_CREDITS  : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ TELEPHONE    : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 1 ...
##  - attr(*, "terms")=Classes 'terms', 'formula'  language RESPONSE ~ CHK_ACCT + DURATION + HISTORY + PURPOSE + AMOUNT + SAV_ACCT +      EMPLOYMENT + INSTALL_RATE + MALE_SI| __truncated__ ...
##   .. ..- attr(*, "variables")= language list(RESPONSE, CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT,      EMPLOYMENT, INSTALL_RATE, MALE_SINGLE| __truncated__ ...
##   .. ..- attr(*, "factors")= int [1:16, 1:15] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:16] "RESPONSE" "CHK_ACCT" "DURATION" "HISTORY" ...
##   .. .. .. ..$ : chr [1:15] "CHK_ACCT" "DURATION" "HISTORY" "PURPOSE" ...
##   .. ..- attr(*, "term.labels")= chr [1:15] "CHK_ACCT" "DURATION" "HISTORY" "PURPOSE" ...
##   .. ..- attr(*, "order")= int [1:15] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(RESPONSE, CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT,      EMPLOYMENT, INSTALL_RATE, MALE_SINGLE| __truncated__ ...
##   .. ..- attr(*, "dataClasses")= Named chr [1:16] "numeric" "numeric" "numeric" "numeric" ...
##   .. .. ..- attr(*, "names")= chr [1:16] "RESPONSE" "CHK_ACCT" "DURATION" "HISTORY" ...

The selected dataset will hence have 999 observations of 16 different variables, 15 of which are the independent variables, 4 of which are continuous variable (i.e.: DURATION,AMOUNT,INSTALL_RATE and NUN_CREDITS) and the remaining are all categorical or dummy variables. The first variable is the output (i.e. RESPONSE), which is also a dummy.

We are now ready to move on with the modelling part of our analysis.

Model

Select modeling technique

The modelling technique that we will be using are the following:

Model Definition
1 Logistic
regression
> Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).
(https://en.wikipedia.org/wiki/Logistic_regression)
2 Decision
trees
> A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
(https://en.wikipedia.org/wiki/Decision_tree)
3 Discriminate
analysis
> Discriminant analysis is statistical technique used to classify observations into non-overlapping groups, based on scores on one or more quantitative predictor variables.
(https://stattrek.com/multiple-regression/discriminant-analysis.aspx)
4 Random
forest
> Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees.
(https://en.wikipedia.org/wiki/Random_forest)
5 Neural
network
> A neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes.
(https://en.wikipedia.org/wiki/Neural_network)
6 XGBoost > XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.
(https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/)

In order to compare the 6 models shown above, we will mainly use the CARET package for each algorithm.

Generate test design

\(H_ {0}\): The \(Model_n\) give the best accuracy and sensitivity.

\(H_ {1}\): It do not give the best values.

Where \(n= (1,2,3,4,5,6)\) and it represent each listed model in the selection technique part.

Build model

To be able to generate the model, first, we need to standardize the data, as the variables have different scales. Nevertheless, We will normalize only the continuous variables, as the categorical and dummy variables have only few different levels.

Now that the normalization is done, lets move on by creation of the training and test set based on the data.This will be done by dividing it in a randomly selection into the two subsets, with 75% of the data in the training set and the remaining 25% in the test set.

As you can see above, the data have the same proportion in the dataset, the training and the testing set, in all of them the dependent variable is biased since it shows a greater tendency for a positive response, for this reason we will evaluate two fits for each algorithm, one with the skewed data and the other with the balance of it. Finally, for being able to compare within them we are going to compute the confussion matrix which include the following information:

  • \[ Accuracy = \frac{TruePositive+TrueNegative} {TruePositive+TrueNegative+FalsePositive+FalseNegative}\] ,
  • \[ Sensitivity = \frac{TruePositive} {TruePositive+FalseNegative}\] -\[ Specifity = \frac{TrueNegative} {TrueNegative+FalsePositive}\]

In short words, the sensitivity measure the true positive rate which is key for this project, since a false positive has a negative impact on our main objective because it would increase the risk of not being able to make agreed payments. Meaning that, in addition to balancing the data, we will focus the second model on maximizing sensitivity.

M1: Logistic regression

The general equation for the model is:

\[ Z_{i} = ln(\frac{P_{i}} {1-P_{i}}) = \beta_0+\beta_1X_1+...+\beta_nX_n \]

For the application of the algorithm we will apply the following steps:

Data set Steps
Unbalanced data As we see earlier the output variable is unbalance. We are going to evaluate the accuracy and the sensitivity of the model, with the following steps:
1) Fit the model.
2) Coefficient analysis
3) Predict .
4)Confusion matrix .
Balanced data In this step, we are going to balance the data with the training.control function and, then, we will evaluate the accuracy and the sensitivity of the model, with the following steps:
5) Fit the model.
6) Predict
7)Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################

train_params <- caret::trainControl(method = "repeatedcv", number = 10, repeats=5) 
#10-Fold Cross Validation   #5 repetitions

mod_lg_fit <- caret::train(RESPONSE ~ ., TrainData, method="glm", 
                           family="binomial",trControl= train_params)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7192  -0.7199   0.3843   0.7077   2.3350  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.93510    0.81482  -1.148  0.25113    
## CHK_ACCT1       0.25725    0.24875   1.034  0.30106    
## CHK_ACCT2       1.15025    0.44137   2.606  0.00916 ** 
## CHK_ACCT3       1.65679    0.25994   6.374 1.84e-10 ***
## DURATION       -0.33249    0.12474  -2.665  0.00769 ** 
## HISTORY1       -0.35513    0.60991  -0.582  0.56039    
## HISTORY2        0.51189    0.48162   1.063  0.28784    
## HISTORY3        0.71627    0.52256   1.371  0.17047    
## HISTORY4        1.41024    0.49046   2.875  0.00404 ** 
## PURPOSE1       -0.92834    0.42913  -2.163  0.03052 *  
## PURPOSE2        0.83586    0.54279   1.540  0.12358    
## PURPOSE3       -0.08614    0.44419  -0.194  0.84624    
## PURPOSE4        0.05502    0.43423   0.127  0.89917    
## PURPOSE5       -0.73150    0.57782  -1.266  0.20553    
## PURPOSE6       -0.08247    0.49176  -0.168  0.86682    
## AMOUNT         -0.32341    0.14085  -2.296  0.02167 *  
## SAV_ACCT1       0.45888    0.33152   1.384  0.16630    
## SAV_ACCT2       0.24327    0.45461   0.535  0.59257    
## SAV_ACCT3       0.70072    0.55496   1.263  0.20671    
## SAV_ACCT4       1.31133    0.31892   4.112 3.93e-05 ***
## EMPLOYMENT1     0.25345    0.41874   0.605  0.54499    
## EMPLOYMENT2     0.63919    0.39563   1.616  0.10617    
## EMPLOYMENT3     1.11825    0.43882   2.548  0.01082 *  
## EMPLOYMENT4     0.72410    0.41270   1.755  0.07933 .  
## INSTALL_RATE   -0.35899    0.11182  -3.210  0.00133 ** 
## MALE_SINGLE1    0.54316    0.20976   2.590  0.00961 ** 
## GUARANTOR1      0.89175    0.48361   1.844  0.06519 .  
## PROPERTY1       0.06645    0.23901   0.278  0.78099    
## PROPERTY2      -0.53913    0.41424  -1.301  0.19309    
## OTHER_INSTALL1 -0.55776    0.24600  -2.267  0.02337 *  
## RESIDENCE1     -0.77956    0.50088  -1.556  0.11962    
## RESIDENCE2     -0.27558    0.47140  -0.585  0.55883    
## NUM_CREDITS    -0.11808    0.12232  -0.965  0.33440    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 916.30  on 749  degrees of freedom
## Residual deviance: 682.31  on 717  degrees of freedom
## AIC: 748.31
## 
## Number of Fisher Scoring iterations: 5

In this step, we can see that the variables that take the highest importance and that are statistically significant for the model are: the second and third level of CHK_ACC, DURATION, the fourth level of HISTORY,the first level of PURPOSE, AMOUNT,the fourth level of SAV_ACCT and the third level of EMPLOYMENT and the first level of OTHER_INSTALL.

2) Coefficients
Table 1: Significance of variable
.
CHK_ACCT1 FALSE
CHK_ACCT2 TRUE
CHK_ACCT3 TRUE
DURATION TRUE
HISTORY1 FALSE
HISTORY2 FALSE
HISTORY3 FALSE
HISTORY4 TRUE
PURPOSE1 TRUE
PURPOSE2 FALSE
PURPOSE3 FALSE
PURPOSE4 FALSE
PURPOSE5 FALSE
PURPOSE6 FALSE
AMOUNT TRUE
SAV_ACCT1 FALSE
SAV_ACCT2 FALSE
SAV_ACCT3 FALSE
SAV_ACCT4 TRUE
EMPLOYMENT1 FALSE
EMPLOYMENT2 FALSE
EMPLOYMENT3 TRUE
EMPLOYMENT4 FALSE
INSTALL_RATE TRUE
MALE_SINGLE1 TRUE
GUARANTOR1 FALSE
PROPERTY1 FALSE
PROPERTY2 FALSE
OTHER_INSTALL1 TRUE
RESIDENCE1 FALSE
RESIDENCE2 FALSE
NUM_CREDITS FALSE

If we look at the coeffiecient of the different variables we can conclude that, among the significant one that we described before, CHK_ACCT, HISTORY (all but the first level), SAV_ACCT, EMPLOYMENT and MALE_SINGLE, have a positive impact on the output, meaning that the higher is their level or if they are positive, the probability of having RESPONSE = 1 will increase.

On the other hand, among the significant variables, DURATION, PURPOSE (all but level two and four), AMOUNT and OTHER_INSTALL have a negative effect on the output, meaning that if they increase their level or value, or if they have a positive value (for the dummies), the probability of having a positive response will decrease.

The linear predictor is given by \[ \eta = - 0.9 + 0.3 * CHKACCT_1 + 1.2 * CHKACCT_2 + 1.7 * CHKACCT_3 - 0.3 * DURATION - 0.4 * HISTORY_1 + 0.5 * HISTORY_2 + 0.7 * HISTORY_3 + 1.4 * HISTORY_4 - 0.9 * PURPOSE_1 + 0.8 * PURPOSE_2 - 0.08 * PURPOSE_3 + 0.05 * PURPOSE_4 - 0.7 * PURPOSE_5 - 0.08 * PURPOSE_6 - 0.3 * AMOUNT - 0.5 * SAVACCT_1 + 0.2 * SAVACCT_2 + 0.7 * SAVACCT_3 + 1.3 * SAVACCT_4 + 0.25 * EMPLOYMENT_1 + * 0.6 EMPLOYMENT_2 + 1.1 * EMPLOYMENT_3 + 0.7 * EMPLOYMENT_4 - 0.4 * INSTALLRATE + 0.5 * MALESINGLE_1 + 0.9 * GUARANTOR_1 + 0.06 * PROPERTY_1 - 0.5 * PROPERTY_2 - 0.6 * OTHERINSTALL_1 - 0.8 * RESIDENCE_1 - 0.3 * RESIDENCE_2 - 0.1 * NUMCREDITS + 0.4426 * TELEPHONE_1 \]

To be clear, if for example the purpose variable takes value 3, only the coefficient of PURPOSE_3 will be added to the others. The same goes for each other categorical variable. For the dummies the coefficient is added only if the value is equal to 1 for the variable, otherwise no. While for the continuous variables the coefficient is multiplied by the value is recorded in the observation.

3) Prediction (unbalance)

Now we will get the predictions using this model. To being able to do it, we will start by getting the probabilities of the output given the coefficients we have found by fitting the model, then we will use a cut point of 0.5 to decide whether the value will be equal to 1 (if the probability it higher than 0.5) or 0 (otherwise). The model basically fit the information of the new observation in the function that is given above, and then it finds a value eta that is then used to get the prediction of the probability of the output by doint p = 1 / (1 + eta). We use this probability to determine whether the prediciton will by 1 or 0, by taking a cut point of 0.5.

#prediction given the model
lg.pred <- predict(mod_lg_fit, newdata = TestData)  

4) Diagnosis (Unbalance)

Balanced data

5) Fitting the model: balance
#Same division
set.seed(1234)

#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down")

mod_lg_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="glm", 
                                  family="binomial", 
                                  metric = "Sens", #optimize sensitivity
                                  maximize = TRUE, #maximize the metric
                                  trControl= train_params)

################check outputs################################vv
summary(mod_lg_fitbalance)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.22739  -0.78261  -0.02752   0.79604   2.53360  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.8770     1.0650  -0.823 0.410229    
## CHK_ACCT1        0.3269     0.3181   1.028 0.304055    
## CHK_ACCT2        1.3096     0.5156   2.540 0.011079 *  
## CHK_ACCT3        1.9754     0.3241   6.096 1.09e-09 ***
## DURATION        -0.2847     0.1556  -1.830 0.067224 .  
## HISTORY1        -0.5277     0.7045  -0.749 0.453894    
## HISTORY2        -0.3269     0.5422  -0.603 0.546492    
## HISTORY3        -0.1523     0.5993  -0.254 0.799437    
## HISTORY4         0.8379     0.5514   1.519 0.128658    
## PURPOSE1        -1.4030     0.5170  -2.714 0.006656 ** 
## PURPOSE2         0.4736     0.6667   0.710 0.477406    
## PURPOSE3        -0.6911     0.5304  -1.303 0.192610    
## PURPOSE4        -0.6082     0.5162  -1.178 0.238638    
## PURPOSE5        -0.7442     0.6780  -1.098 0.272320    
## PURPOSE6        -0.7298     0.5839  -1.250 0.211399    
## AMOUNT          -0.2435     0.1755  -1.387 0.165326    
## SAV_ACCT1        0.5952     0.4076   1.460 0.144271    
## SAV_ACCT2       -0.2930     0.5523  -0.530 0.595804    
## SAV_ACCT3        0.4723     0.6553   0.721 0.471134    
## SAV_ACCT4        1.2719     0.3723   3.417 0.000634 ***
## EMPLOYMENT1      0.8801     0.5906   1.490 0.136204    
## EMPLOYMENT2      1.2155     0.5581   2.178 0.029427 *  
## EMPLOYMENT3      1.7196     0.6056   2.839 0.004519 ** 
## EMPLOYMENT4      1.0968     0.5823   1.883 0.059652 .  
## INSTALL_RATE    -0.3919     0.1387  -2.825 0.004730 ** 
## MALE_SINGLE1     0.7278     0.2648   2.749 0.005983 ** 
## GUARANTOR1       0.4654     0.6556   0.710 0.477814    
## PROPERTY1        0.2035     0.2913   0.699 0.484762    
## PROPERTY2       -1.0469     0.5950  -1.760 0.078487 .  
## OTHER_INSTALL1  -0.7800     0.3200  -2.438 0.014775 *  
## RESIDENCE1      -0.9846     0.6864  -1.435 0.151419    
## RESIDENCE2      -0.7140     0.6671  -1.070 0.284474    
## NUM_CREDITS     -0.1793     0.1529  -1.173 0.240747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 623.83  on 449  degrees of freedom
## Residual deviance: 449.68  on 417  degrees of freedom
## AIC: 515.68
## 
## Number of Fisher Scoring iterations: 5
6) Prediction
#probability given the model
lg.pred.b <- predict(mod_lg_fitbalance, newdata = TestData)  

Contrary to the output of the first model we can see that the proportion of the prediction is better balanced.

7) Diagnosis (balance)

M2: Decision trees

The next image illustrates better the way of working of decision trees.

Figure 4: A caption

A caption

The process consist in the minimization of the classification error rate

\[E=\ 1-max_{k}(p_{mk})\] , where \(p_{mk}\) is the proportion of training observation.

For the application of the algorithm we will apply the following steps:

Data set Steps
Unbalanced data As we see earlier the output variable is unbalance. We are going to evaluate the accuracy and the sensitivity of the model, with the following steps:
1) Fit the model.
2) Plot the best tree
3) Predict .
4)Confusion matrix .
Balanced data In this step, we are going to balance the data with the training.control function and, then, we will evaluate the accuracy and the sensitivity of the model, with the following steps:
5) Fit the model.
6) Plot the best tree
6) Predict
7)Confusion matrix

UnBalanced Data

1) Fitting the model

We will start by fitting the model on the data.

#Same division
set.seed(1234)

#########################model######################################
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) 

#10-Fold Cross Validation #5 repetions

mod_dt_fit <- caret::train(RESPONSE ~ ., TrainData, method="rpart", 
                           trControl= train_params)
##      var                  n               wt             dev       
##  Length:15          Min.   : 14.0   Min.   : 14.0   Min.   :  3.0  
##  Class :character   1st Qu.: 34.0   1st Qu.: 34.0   1st Qu.:  9.0  
##  Mode  :character   Median :126.0   Median :126.0   Median : 42.0  
##                     Mean   :185.1   Mean   :185.1   Mean   : 62.8  
##                     3rd Qu.:252.0   3rd Qu.:252.0   3rd Qu.: 84.0  
##                     Max.   :750.0   Max.   :750.0   Max.   :225.0  
##       yval       complexity          ncompete       nsurrogate 
##  Min.   :1.0   Min.   :0.000000   Min.   :0.000   Min.   :0.0  
##  1st Qu.:1.0   1st Qu.:0.002222   1st Qu.:0.000   1st Qu.:0.0  
##  Median :2.0   Median :0.019259   Median :0.000   Median :0.0  
##  Mean   :1.6   Mean   :0.015605   Mean   :1.867   Mean   :0.8  
##  3rd Qu.:2.0   3rd Qu.:0.026667   3rd Qu.:4.000   3rd Qu.:0.5  
##  Max.   :2.0   Max.   :0.026667   Max.   :4.000   Max.   :5.0  
##       yval2.V1             yval2.V2             yval2.V3             yval2.V4             yval2.V5          yval2.nodeprob   
##  Min.   :1.0          Min.   :  3.00000    Min.   :  4.0        Min.   :0.1174497    Min.   :0.1428571    Min.   :0.0186667  
##  1st Qu.:1.0          1st Qu.: 26.50000    1st Qu.: 15.5        1st Qu.:0.3346560    1st Qu.:0.3971861    1st Qu.:0.0453333  
##  Median :2.0          Median : 42.00000    Median : 60.0        Median :0.4285714    Median :0.5714286    Median :0.1680000  
##  Mean   :1.6          Mean   : 67.13333    Mean   :118.0        Mean   :0.4620491    Mean   :0.5379509    Mean   :0.2468444  
##  3rd Qu.:2.0          3rd Qu.: 84.50000    3rd Qu.:167.5        3rd Qu.:0.6028139    3rd Qu.:0.6653440    3rd Qu.:0.3360000  
##  Max.   :2.0          Max.   :225.00000    Max.   :525.0        Max.   :0.8571429    Max.   :0.8825503    Max.   :1.0000000
2) Plot (Unbalance)

3) Prediction (unbalance)
#prediction given the model
dt.pred <- predict(mod_dt_fit, newdata = TestData)  #predict give me the probability i am looking for the the binomial answer

The prediction is clearly bias to a positive answer.

4) Diagnosis (Unbalance)

We can see that here the sensitivity is really low, while the specificity is higher, reaching a value above 92%, which is in any case the one in which we are the most interested. The accuracy is around 74%. Here, in 30 cases in which the model should have given a negative value, it actually predicted a positive one, and it could cost quite a lot to the company.

Balanced data

5) Fitting the model: balance
##      var                  n               wt             dev       
##  Length:5           Min.   : 37.0   Min.   : 37.0   Min.   : 14.0  
##  Class :character   1st Qu.:152.0   1st Qu.:152.0   1st Qu.: 35.0  
##  Mode  :character   Median :261.0   Median :261.0   Median : 85.0  
##                     Mean   :239.6   Mean   :239.6   Mean   : 93.4  
##                     3rd Qu.:298.0   3rd Qu.:298.0   3rd Qu.:108.0  
##                     Max.   :450.0   Max.   :450.0   Max.   :225.0  
##       yval       complexity         ncompete     nsurrogate 
##  Min.   :1.0   Min.   :0.01111   Min.   :0.0   Min.   :0.0  
##  1st Qu.:1.0   1st Qu.:0.01111   1st Qu.:0.0   1st Qu.:0.0  
##  Median :1.0   Median :0.02222   Median :0.0   Median :0.0  
##  Mean   :1.4   Mean   :0.08978   Mean   :1.6   Mean   :0.6  
##  3rd Qu.:2.0   3rd Qu.:0.04000   3rd Qu.:4.0   3rd Qu.:0.0  
##  Max.   :2.0   Max.   :0.36444   Max.   :4.0   Max.   :3.0  
##       yval2.V1             yval2.V2             yval2.V3             yval2.V4             yval2.V5          yval2.nodeprob   
##  Min.   :1.0          Min.   : 14          Min.   : 23.0        Min.   :0.2302632    Min.   :0.3256705    Min.   :0.0822222  
##  1st Qu.:1.0          1st Qu.: 35          1st Qu.: 85.0        1st Qu.:0.3783784    1st Qu.:0.3624161    1st Qu.:0.3377778  
##  Median :1.0          Median :176          Median :108.0        Median :0.5000000    Median :0.5000000    Median :0.5800000  
##  Mean   :1.4          Mean   :128          Mean   :111.6        Mean   :0.4841110    Mean   :0.5158890    Mean   :0.5324444  
##  3rd Qu.:2.0          3rd Qu.:190          3rd Qu.:117.0        3rd Qu.:0.6375839    3rd Qu.:0.6216216    3rd Qu.:0.6622222  
##  Max.   :2.0          Max.   :225          Max.   :225.0        Max.   :0.6743295    Max.   :0.7697368    Max.   :1.0000000
6) Plot (balance)

7) Prediction

8) Diagnosis (balance)

We can see that here the sensitivity have been improve, however the specificity is lower, reaching a value above 55%, which is in any case the one in which we are the most interested. The accuracy is around 60%.

M3: Discriminate analysis

There are four types of discriminate analysis, we will explain them in the following table:

Model Definition
1 LDA > Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher’s linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events.
(https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
2 QDA >This method assume that the measurements from each class are normally distributed, but there is not assumption saying that the covariance of each of the classes is identical.When the normality assumption is true, the best possible test for the hypothesis that a given measurement is from a given class is the likelihood ratio test.
(https://en.wikipedia.org/wiki/Quadratic_classifier)
3 FDA > It analyzes data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework each sample element is considered to be a function.
(https://en.wikipedia.org/wiki/Functional_data_analysis)
4 MDA > It is a multivariate dimensionality reduction technique. It has been used to predict signals as diverse as neural memory traces and corporate failure.
(https://en.wikipedia.org/wiki/Multiple_discriminant_analysis)

The next image illustrates better the way of working for each model.

Linear Discriminant Analysis

Steps for the aplication of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Predict .
3)Confusion matrix .
Balanced data 4) Fit the model.
5) Predict
6)Confusion matrix

Unbalanced

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, repeats=5) 
#K-Fold Cross Validation
mod_lda_fit <- caret::train(RESPONSE ~ ., TrainData, method="lda", 
                           family="binomial",trControl= train_params)
##             Length Class      Mode     
## prior        2     -none-     numeric  
## counts       2     -none-     numeric  
## means       64     -none-     numeric  
## scaling     32     -none-     numeric  
## lev          2     -none-     character
## svd          1     -none-     numeric  
## N            1     -none-     numeric  
## call         4     -none-     call     
## xNames      32     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        1     -none-     list

We can see from the graph that the model seems to be better at predict the positive value as the mean seems to be around the value 1, while for the negative output the mean seems to be a bit lower than 0, which is the value it should take, and it is around -1.

The linear combination of predictor variables that are used to form the decision rule is the following:

\[ RESPONSE = -0.3265 * DURATION -0.2747 * HISTORY_1 + 0.8792 * HISTORY_2 + 1.1810 * HISTORY_3 + 1.6214 * HISTORY_4 - 0.7437 * PURPOSE_1 + 0.7736 * PURPOSE_2 + 0.0172 * PURPOSE_3 + 0.2035 * PURPOSE_4 - 0.6116 * PURPOSE_5 + 0.1298 * PURPOSE_6 - 0.2579 * AMOUNT + 0.5066 * SAV_ACCT_1 + 0.7517 * SAV_ACCT_2 + 0.7778 * SAV_ACCT_3 + 1.0997 * SAV_ACCT_4 + 0.6175 * EMPLOYMENT_1 + 1.1982 * EMPLOYMENT_2 + 1.4580 * EMPLOYMENT_3 + 1.1806 * EMPLOYMENT_4 - 0.2897 * INSTALL_RATE - 0.5663 * SEX_MALE_1 + 0.9272 * MALE_SINGLE_1 + 0.5183 * MALE_MAR_WID_1 - 0.0863 * CO_APPLICANT_1 + 0.5084 * GUARANTOR_1 - 0.2437 * PRESENT_RESIDENT_-1 - 0.1913 * PRESENT_RESIDENT_0 - 0.0477 * PRESENT_RESIDENT_1 + 0.0367 * PROPERTY_1 - 0.6374 * PROPERTY_2 + 0.0502 * AGE - 0.4501 * OTHER_INSTALL_1 - 0.7509 * RESIDENCE_1 - 0.2185 * RESIDENCE_2 - 0.0703 * NUM_CREDITS - 1.0570 * JOB_1 - 1.0581 * JOB_2 - 0.8362 * JOB_3\]

Each new observation will be evaluated thanks to this formula, with its information put inside of it. It follows the same principle described for the generalized linear model.

2) Prediction (unbalance)
lda.pred <- predict(mod_lda_fit, newdata = TestData)  #predict give me the probability i am looking for the the binomial answer

With this graph we confirm the explained in the fit part that the prediction tends to give us a positive response.

3) Diagnosis (Unbalance)

Here the sensitivity is higher with respect to the previous unbalance models (above 40%), but it is still quite low. If we look at the preicision, is quite low, as it is only around 78%. What is important to note is that 22 times in which the model would have predicted a positive value for the output, it should have been negative, which is something that could cost quite a lot to the copmany.

Balanced data

4) Fitting the model: balance
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down", 
                                    summaryFunction = twoClassSummary)


mod_lda_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="lda", 
                           family="binomial",
                           metric = "Sens", #optimize sensitivity
                           maximize = TRUE, #maximize the metric
                           trControl= train_params)
##             Length Class      Mode     
## prior        2     -none-     numeric  
## counts       2     -none-     numeric  
## means       64     -none-     numeric  
## scaling     32     -none-     numeric  
## lev          2     -none-     character
## svd          1     -none-     numeric  
## N            1     -none-     numeric  
## call         4     -none-     call     
## xNames      32     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        1     -none-     list
5) Prediction
lda.pred.b <- predict(mod_lda_fitbalance, newdata = TestData)  #predict give me the probability i am looking for the the binomial answer

6) Diagnosis (balance)

Quadratic discriminant analysis

Steps for the aplication of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Predict .
3)Confusion matrix .
Balanced data 4) Fit the model.
5) Predict
6)Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation
mod_qda_fit <- caret::train(RESPONSE ~ ., TrainData, method="qda", 
                           family="binomial",trControl= train_params)
##             Length Class      Mode     
## prior          2   -none-     numeric  
## counts         2   -none-     numeric  
## means         64   -none-     numeric  
## scaling     2048   -none-     numeric  
## ldet           2   -none-     numeric  
## lev            2   -none-     character
## N              1   -none-     numeric  
## call           4   -none-     call     
## xNames        32   -none-     character
## problemType    1   -none-     character
## tuneValue      1   data.frame list     
## obsLevels      2   -none-     character
## param          1   -none-     list
2) Prediction (unbalance)

Here, it seems still that the majority of the false prediction are in the positive level, however they seem less than before.

3) Diagnosis (Unbalance)

As we predicted, the model performs a little bit worse than the LDA, but for the sensitivity, which is the highest up to now (over 50%), is still quite low, though. The specificity is moderately high (above 85%) as it is the accuracy (above 70%). As we want to have a value for the false positive low, the 35 here is still quite high.

Balanced data

4) Fitting the model: balance
5) Prediction
qda.pred.b <- predict(mod_qda_fitbalance, newdata = TestData)  

6) Diagnosis (balance)

Functional data analysis (FDA)

Steps for the aplication of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Predict .
3)Confusion matrix .
Balanced data 4) Fit the model.
5) Predict
6)Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation

library(earth)
mod_fda_fit <- caret::train(RESPONSE ~ ., TrainData, method="fda", 
                              trControl= train_params)
## Flexible Discriminant Analysis 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ... 
## Resampling results across tuning parameters:
## 
##   nprune  Accuracy   Kappa    
##    2      0.7000199  0.0000000
##   13      0.7349707  0.3090363
##   25      0.7477476  0.3574288
## 
## Tuning parameter 'degree' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 1 and nprune = 25.

Here we see that this model has only one dimensino that, obviously, explains the 100% of the between group variance. We expect that it will perfom poorly in the predicitons.

2) Prediction (unbalance)
fda.pred <- predict(mod_fda_fit, newdata = TestData)  

3) Diagnosis (Unbalance)

Contrary to what it is expected, this model has a sensitivity of almost 55%, among one of the highest up to now, and the specificity is higher than 85%. The accuracy is around 76%. The false positive observations are 36.

Balanced data

4) Fitting the model: balance
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down", 
                                    summaryFunction = twoClassSummary)

mod_fda_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="fda", 
                                   metric = "Sens", #optimize sensitivity
                                    maximize = TRUE,
                                    trControl= train_params)
##                   Length Class      Mode     
## percent.explained  1     -none-     numeric  
## values             1     -none-     numeric  
## means              2     -none-     numeric  
## theta.mod          1     -none-     numeric  
## dimension          1     -none-     numeric  
## prior              2     table      numeric  
## fit               29     earth      list     
## call               7     -none-     call     
## terms              3     terms      call     
## confusion          4     table      numeric  
## xNames            32     -none-     character
## problemType        1     -none-     character
## tuneValue          2     data.frame list     
## obsLevels          2     -none-     character
## param              0     -none-     list
5) Prediction

Using the model we get the prediction for the RESPONSE variable and we can construct the confidence matrix for this case.

fda.pred.b <- predict(mod_fda_fitbalance, newdata = TestData)  #predict give me the probability i am looking for the the binomial answer

6) Diagnosis (balance)

Mixture discriminant analysis (MDA)

Steps for the aplication of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Predict .
3)Confusion matrix .
Balanced data 4) Fit the model.
5) Predict
6)Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation
mod_mda_fit <- caret::train(RESPONSE ~ ., TrainData, method="mda", 
                           family="binomial",trControl= train_params)
##                   Length Class      Mode     
## percent.explained  3     -none-     numeric  
## values             3     -none-     numeric  
## means             12     -none-     numeric  
## theta.mod          9     -none-     numeric  
## dimension          1     -none-     numeric  
## sub.prior          2     -none-     list     
## fit                5     polyreg    list     
## call               5     -none-     call     
## weights            2     -none-     list     
## prior              2     table      numeric  
## assign.theta       2     -none-     list     
## deviance           1     -none-     numeric  
## confusion          4     table      numeric  
## terms              3     terms      call     
## xNames            32     -none-     character
## problemType        1     -none-     character
## tuneValue          1     data.frame list     
## obsLevels          2     -none-     character
## param              1     -none-     list

The summary gives the percentage of the variance that there is inside the different groups that has been created (which in this case are 3), and we can see that the vast majority of the variance is explained thanks to the first three groups (reaching more than 90%), but we could also be satisfied only considering the first two groups (as they each almost 80% of the variance explained).

2) Prediction (unbalance)
mda.pred <- predict(mod_mda_fit, newdata = TestData)  

3) Diagnosis (Unbalance)

The sensitivity in this case is around 55%, while the specificity is higher, reaching almost 85%. The accurary is around 75%.

Balanced data

4) Fitting the model
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down", 
                                    summaryFunction = twoClassSummary)

mod_mda_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="mda", 
                           family="binomial",
                           metric = "Sens", #optimize sensitivity
                           maximize = TRUE,
                           trControl= train_params)
## Mixture Discriminant Analysis 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 676, 675, 674, 676, 674, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   subclasses  ROC  Sens       Spec     
##   2           NaN  0.6917787  0.7012772
##   3           NaN  0.6868775  0.7046952
##   4           NaN  0.6472727  0.6982366
## 
## Sens was used to select the optimal model using the largest value.
## The final value used for the model was subclasses = 2.
5) Prediction
mda.pred.b <- predict(mod_mda_fitbalance, newdata = TestData)  #predict give me the probability i am looking for the the binomial answer

6) Diagnosis

M4: Random Forest

Steps for the aplication of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2)Checking Variables.
3) Predict .
4)Confusion matrix .
Balanced data 5) Fit the model.
6) Predict
7)Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation
mod_rf_fit <- caret::train(RESPONSE ~ ., TrainData, method="rf", 
                           trControl= train_params)
## Random Forest 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7235177  0.1525749
##   17    0.7477162  0.3489012
##   32    0.7445090  0.3450546
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 17.
2) Checking Variables
##                        0          1 MeanDecreaseAccuracy MeanDecreaseGini
## CHK_ACCT      30.8377522 14.9310529          29.62471840        24.909137
## DURATION       3.0789831 17.2512610          16.97809183        23.722782
## HISTORY       10.6308613 10.2319846          14.95907324        16.225153
## PURPOSE        4.6632195  3.8556094           5.99145514        21.218288
## AMOUNT         2.7620518 12.2473068          13.09750874        37.191057
## SAV_ACCT       9.7424546  2.4221673           7.48429134        12.897821
## EMPLOYMENT     4.7202650  3.2030500           5.62544435        15.860447
## INSTALL_RATE  -1.5100678  2.6360360           1.30398030        10.326397
## MALE_SINGLE    3.2165679  0.1469817           2.12884373         5.216763
## GUARANTOR      2.6848191  9.3310231           9.12005837         2.063821
## PROPERTY       0.8703037  4.2813758           4.00866662         7.959729
## OTHER_INSTALL  1.6732583  3.4119582           3.71946567         4.919703
## RESIDENCE     -0.4792045  0.2264302          -0.09300405         6.525084
## NUM_CREDITS   -1.0737951  4.8444649           3.40369292         5.752978

The summary of the models gives the decrease in accuracy and the decrease of the gini index for each variable in the model, along with the number of trees that are built (500 in our case), the number of variabes that are randomly chones to be tried at each split before choosing which one is the best one to describe the node. Moreover, we can already find the confusion matrix (we will show it better again afterwards to keep the coherence of the analysis throuhgout all the models), with the class errorand the Out-Of-Bag estimate of the error rate.

Let’s give some definitions to be clearer:

Variable importance is the mean decrease of accuracy over all out-of-bag cross validated predictions, when a given variable is permuted after training, but before prediction.

GINI importance measures the average gain of purity by splits of a given variable. If the variable is useful, it tends to split mixed labeled nodes into pure single class nodes. Splitting by a permuted variables tend neither to increase nor decrease node purities.

source:https://stats.stackexchange.com/questions/197827/how-to-interpret-mean-decrease-in-accuracy-and-mean-decrease-gini-in-random-fore

Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging) to sub-sample data samples used for training. OOB is the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample. source: https://en.wikipedia.org/wiki/Out-of-bag_error

3) Prediction (unbalance)

4) Diagnosis (Unbalance)

Here the sensitivity is around 46%, the specificity is high (more than 87%) and the accuracy is aroun 75%, while the number of false positive is the highest, having 36 observations.

Balanced data

5) Fitting the model: balance
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down", 
                                    summaryFunction = twoClassSummary)

mod_rf_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="rf", 
                           family="binomial",
                           metric = "Sens", #optimize sensitivity
                           maximize = TRUE,
                           trControl= train_params)
## Random Forest 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 674, 676, 675, 675, 675, 675, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  ROC  Sens       Spec     
##    2    NaN  0.7576285  0.6438099
##   17    NaN  0.7306324  0.6694557
##   32    NaN  0.7314229  0.6641074
## 
## Sens was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
6) Prediction
rf.pred.b <- predict(mod_rf_fitbalance, newdata = TestData)  

7) Diagnosis

M5: Neural Networks

Steps for the aplication of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Plot
3) Predict .
4)Confusion matrix .
Balanced data 5) Fit the model.
6) Plot
7) Predict
8)Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) 
mod_nn_fit <- caret::train(RESPONSE ~ ., TrainData, method="nnet", 
                           trControl= train_params)
## Neural Network 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   1     0e+00  0.7050009  0.3312729
##   1     1e-04  0.7093730  0.3370020
##   1     1e-01  0.7509943  0.3830480
##   3     0e+00  0.7003229  0.2899405
##   3     1e-04  0.7104328  0.2916348
##   3     1e-01  0.7254274  0.3296829
##   5     0e+00  0.6965142  0.2769069
##   5     1e-04  0.7045505  0.2925045
##   5     1e-01  0.7069400  0.2912240
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 1 and decay = 0.1.
2) Plot

3) Prediction (unbalance)
nn.pred <- predict(mod_nn_fit, newdata = TestData)  

4) Diagnosis (Unbalance)

Balanced data

5) Fitting the model: balance
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down", 
                                    summaryFunction = twoClassSummary)

mod_nn_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="nnet", 
                           family="binomial",
                           metric = "Sens", #optimize sensitivity
                           maximize = TRUE,
                           trControl= train_params)
## Neural Network 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 675, 675, 675, 676, 675, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   size  decay  ROC  Sens       Spec     
##   1     0e+00  NaN  0.6678656  0.6661030
##   1     1e-04  NaN  0.6657312  0.6926270
##   1     1e-01  NaN  0.7137945  0.6823585
##   3     0e+00  NaN  0.6244269  0.6978447
##   3     1e-04  NaN  0.6408300  0.6695718
##   3     1e-01  NaN  0.6767589  0.6727431
##   5     0e+00  NaN  0.6554941  0.6762554
##   5     1e-04  NaN  0.6499605  0.6538824
##   5     1e-01  NaN  0.6807905  0.6580552
## 
## Sens was used to select the optimal model using the largest value.
## The final values used for the model were size = 1 and decay = 0.1.
6) Prediction

7) Diagnosis (balance)

M6: XGBoost

Steps for the aplication of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Plot
3) Predict .
3)Confusion matrix .
Balanced data 4) Fit the model.
5) Predict
6)Confusion matrix

Unbalanced data

1) Fitting the model
library(dplyr)
######################### transform data ############
data_xgboost <- purrr::map_df(data_scale, function(columna) {
                          columna %>% 
                          as.factor() %>% 
                          as.numeric %>% 
                          { . - 1 } })

test_xgboost <- sample_frac(data_xgboost, size = 0.249)
train_xgboost <- setdiff(data_xgboost, test_xgboost)


#Convertir a DMatrix

train_xgb_matrix <-   train_xgboost %>% 
                            dplyr::select(- RESPONSE) %>% 
                            as.matrix() %>% 
                            xgboost::xgb.DMatrix(data = ., label = train_xgboost$RESPONSE)
#Convertir a DMatrix

test_xgb_matrix <-  test_xgboost %>% 
                            dplyr::select(- RESPONSE) %>% 
                            as.matrix() %>% 
                            xgboost::xgb.DMatrix(data = ., label = test_xgboost$RESPONSE)

#Same division
set.seed(1234)

#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", 
                             number = 10, # with n folds 
                             repeats=5) #K-Fold Cross Validation

mod_xgb_fit <- caret::train(RESPONSE ~ ., TrainData, 
                           method="xgbTree", 
                           trControl= train_params)
## eXtreme Gradient Boosting 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  Accuracy   Kappa    
##   0.3  1          0.6               0.50        50      0.7482840  0.3272571
##   0.3  1          0.6               0.50       100      0.7461437  0.3454669
##   0.3  1          0.6               0.50       150      0.7483235  0.3577730
##   0.3  1          0.6               0.75        50      0.7400344  0.2980812
##   0.3  1          0.6               0.75       100      0.7528570  0.3554454
##   0.3  1          0.6               0.75       150      0.7521142  0.3627987
##   0.3  1          0.6               1.00        50      0.7325740  0.2539676
##   0.3  1          0.6               1.00       100      0.7434797  0.3159928
##   0.3  1          0.6               1.00       150      0.7538954  0.3571776
##   0.3  1          0.8               0.50        50      0.7395300  0.3093557
##   0.3  1          0.8               0.50       100      0.7496609  0.3575119
##   0.3  1          0.8               0.50       150      0.7510512  0.3668341
##   0.3  1          0.8               0.75        50      0.7442590  0.3085861
##   0.3  1          0.8               0.75       100      0.7510045  0.3545181
##   0.3  1          0.8               0.75       150      0.7502437  0.3581038
##   0.3  1          0.8               1.00        50      0.7312267  0.2485682
##   0.3  1          0.8               1.00       100      0.7480421  0.3296624
##   0.3  1          0.8               1.00       150      0.7528463  0.3555328
##   0.3  2          0.6               0.50        50      0.7466486  0.3464760
##   0.3  2          0.6               0.50       100      0.7501551  0.3703456
##   0.3  2          0.6               0.50       150      0.7507061  0.3724919
##   0.3  2          0.6               0.75        50      0.7528046  0.3659881
##   0.3  2          0.6               0.75       100      0.7582201  0.3889267
##   0.3  2          0.6               0.75       150      0.7576508  0.3957528
##   0.3  2          0.6               1.00        50      0.7574088  0.3681306
##   0.3  2          0.6               1.00       100      0.7587172  0.3889253
##   0.3  2          0.6               1.00       150      0.7576754  0.3886079
##   0.3  2          0.8               0.50        50      0.7488500  0.3582339
##   0.3  2          0.8               0.50       100      0.7525231  0.3810771
##   0.3  2          0.8               0.50       150      0.7552828  0.3932617
##   0.3  2          0.8               0.75        50      0.7560542  0.3710842
##   0.3  2          0.8               0.75       100      0.7550308  0.3853361
##   0.3  2          0.8               0.75       150      0.7529002  0.3805932
##   0.3  2          0.8               1.00        50      0.7531167  0.3540959
##   0.3  2          0.8               1.00       100      0.7531418  0.3727792
##   0.3  2          0.8               1.00       150      0.7534268  0.3788800
##   0.3  3          0.6               0.50        50      0.7486047  0.3617022
##   0.3  3          0.6               0.50       100      0.7379902  0.3453301
##   0.3  3          0.6               0.50       150      0.7390254  0.3507596
##   0.3  3          0.6               0.75        50      0.7483622  0.3622465
##   0.3  3          0.6               0.75       100      0.7453402  0.3627887
##   0.3  3          0.6               0.75       150      0.7418983  0.3568721
##   0.3  3          0.6               1.00        50      0.7534335  0.3741932
##   0.3  3          0.6               1.00       100      0.7547320  0.3817087
##   0.3  3          0.6               1.00       150      0.7538785  0.3830326
##   0.3  3          0.8               0.50        50      0.7490844  0.3705713
##   0.3  3          0.8               0.50       100      0.7411372  0.3605892
##   0.3  3          0.8               0.50       150      0.7384817  0.3558424
##   0.3  3          0.8               0.75        50      0.7538962  0.3792677
##   0.3  3          0.8               0.75       100      0.7451059  0.3633473
##   0.3  3          0.8               0.75       150      0.7362908  0.3459723
##   0.3  3          0.8               1.00        50      0.7491166  0.3575555
##   0.3  3          0.8               1.00       100      0.7472608  0.3650374
##   0.3  3          0.8               1.00       150      0.7445621  0.3566864
##   0.4  1          0.6               0.50        50      0.7491444  0.3446642
##   0.4  1          0.6               0.50       100      0.7494396  0.3587924
##   0.4  1          0.6               0.50       150      0.7513245  0.3705925
##   0.4  1          0.6               0.75        50      0.7472390  0.3297919
##   0.4  1          0.6               0.75       100      0.7528820  0.3637588
##   0.4  1          0.6               0.75       150      0.7545074  0.3749976
##   0.4  1          0.6               1.00        50      0.7437505  0.3017804
##   0.4  1          0.6               1.00       100      0.7512570  0.3505309
##   0.4  1          0.6               1.00       150      0.7539559  0.3619362
##   0.4  1          0.8               0.50        50      0.7413290  0.3250351
##   0.4  1          0.8               0.50       100      0.7430075  0.3447676
##   0.4  1          0.8               0.50       150      0.7531419  0.3755523
##   0.4  1          0.8               0.75        50      0.7496670  0.3379619
##   0.4  1          0.8               0.75       100      0.7502046  0.3581108
##   0.4  1          0.8               0.75       150      0.7486436  0.3629930
##   0.4  1          0.8               1.00        50      0.7368238  0.2829062
##   0.4  1          0.8               1.00       100      0.7517938  0.3506239
##   0.4  1          0.8               1.00       150      0.7544964  0.3647194
##   0.4  2          0.6               0.50        50      0.7421827  0.3465905
##   0.4  2          0.6               0.50       100      0.7430006  0.3569356
##   0.4  2          0.6               0.50       150      0.7384670  0.3501304
##   0.4  2          0.6               0.75        50      0.7498851  0.3680255
##   0.4  2          0.6               0.75       100      0.7493772  0.3738035
##   0.4  2          0.6               0.75       150      0.7483385  0.3754208
##   0.4  2          0.6               1.00        50      0.7589952  0.3804304
##   0.4  2          0.6               1.00       100      0.7603501  0.3944952
##   0.4  2          0.6               1.00       150      0.7581704  0.3941632
##   0.4  2          0.8               0.50        50      0.7523062  0.3697734
##   0.4  2          0.8               0.50       100      0.7469514  0.3680789
##   0.4  2          0.8               0.50       150      0.7411589  0.3562793
##   0.4  2          0.8               0.75        50      0.7572997  0.3858735
##   0.4  2          0.8               0.75       100      0.7506957  0.3769069
##   0.4  2          0.8               0.75       150      0.7506709  0.3799382
##   0.4  2          0.8               1.00        50      0.7547385  0.3686234
##   0.4  2          0.8               1.00       100      0.7566055  0.3849303
##   0.4  2          0.8               1.00       150      0.7530853  0.3822790
##   0.4  3          0.6               0.50        50      0.7469443  0.3697572
##   0.4  3          0.6               0.50       100      0.7317782  0.3378332
##   0.4  3          0.6               0.50       150      0.7309212  0.3396622
##   0.4  3          0.6               0.75        50      0.7472752  0.3662405
##   0.4  3          0.6               0.75       100      0.7446365  0.3656576
##   0.4  3          0.6               0.75       150      0.7384600  0.3528925
##   0.4  3          0.6               1.00        50      0.7482884  0.3607007
##   0.4  3          0.6               1.00       100      0.7471789  0.3670658
##   0.4  3          0.6               1.00       150      0.7415641  0.3575756
##   0.4  3          0.8               0.50        50      0.7462399  0.3662525
##   0.4  3          0.8               0.50       100      0.7382147  0.3529408
##   0.4  3          0.8               0.50       150      0.7315431  0.3372657
##   0.4  3          0.8               0.75        50      0.7589666  0.3965314
##   0.4  3          0.8               0.75       100      0.7419266  0.3589729
##   0.4  3          0.8               0.75       150      0.7331081  0.3405444
##   0.4  3          0.8               1.00        50      0.7464645  0.3570423
##   0.4  3          0.8               1.00       100      0.7455861  0.3627928
##   0.4  3          0.8               1.00       150      0.7442879  0.3619007
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 2, eta
##  = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
##  = 1.
2) Plot
##    nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 80     100         2 0.4     0              0.6                1         1
3) Prediction (unbalance)
xgb.pred <- predict(mod_xgb_fit, newdata = TestData)  

4) Diagnosis (Unbalance)

Balanced data

5) Fitting the model: balance
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down", 
                                    summaryFunction = twoClassSummary)

mod_xgb_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="xgbTree", 
                            metric = "Sens", #optimize sensitivity
                           maximize = TRUE,
                           trControl= train_params)
## eXtreme Gradient Boosting 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 676, 675, 676, 675, 675, 674, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  ROC  Sens     
##   0.3  1          0.6               0.50        50      NaN  0.7436364
##   0.3  1          0.6               0.50       100      NaN  0.7223715
##   0.3  1          0.6               0.50       150      NaN  0.7160474
##   0.3  1          0.6               0.75        50      NaN  0.7443478
##   0.3  1          0.6               0.75       100      NaN  0.7346640
##   0.3  1          0.6               0.75       150      NaN  0.7231225
##   0.3  1          0.6               1.00        50      NaN  0.7586166
##   0.3  1          0.6               1.00       100      NaN  0.7514625
##   0.3  1          0.6               1.00       150      NaN  0.7373518
##   0.3  1          0.8               0.50        50      NaN  0.7238340
##   0.3  1          0.8               0.50       100      NaN  0.7204743
##   0.3  1          0.8               0.50       150      NaN  0.7081028
##   0.3  1          0.8               0.75        50      NaN  0.7375099
##   0.3  1          0.8               0.75       100      NaN  0.7303162
##   0.3  1          0.8               0.75       150      NaN  0.7170751
##   0.3  1          0.8               1.00        50      NaN  0.7455336
##   0.3  1          0.8               1.00       100      NaN  0.7375494
##   0.3  1          0.8               1.00       150      NaN  0.7260870
##   0.3  2          0.6               0.50        50      NaN  0.7149407
##   0.3  2          0.6               0.50       100      NaN  0.6982609
##   0.3  2          0.6               0.50       150      NaN  0.7052964
##   0.3  2          0.6               0.75        50      NaN  0.7252174
##   0.3  2          0.6               0.75       100      NaN  0.7153755
##   0.3  2          0.6               0.75       150      NaN  0.7029249
##   0.3  2          0.6               1.00        50      NaN  0.7319763
##   0.3  2          0.6               1.00       100      NaN  0.7150198
##   0.3  2          0.6               1.00       150      NaN  0.7026877
##   0.3  2          0.8               0.50        50      NaN  0.7152174
##   0.3  2          0.8               0.50       100      NaN  0.7042688
##   0.3  2          0.8               0.50       150      NaN  0.7025692
##   0.3  2          0.8               0.75        50      NaN  0.7320158
##   0.3  2          0.8               0.75       100      NaN  0.7124901
##   0.3  2          0.8               0.75       150      NaN  0.6990909
##   0.3  2          0.8               1.00        50      NaN  0.7340316
##   0.3  2          0.8               1.00       100      NaN  0.7305929
##   0.3  2          0.8               1.00       150      NaN  0.7359684
##   0.3  3          0.6               0.50        50      NaN  0.6966798
##   0.3  3          0.6               0.50       100      NaN  0.7009881
##   0.3  3          0.6               0.50       150      NaN  0.6938340
##   0.3  3          0.6               0.75        50      NaN  0.7028458
##   0.3  3          0.6               0.75       100      NaN  0.6962846
##   0.3  3          0.6               0.75       150      NaN  0.6982609
##   0.3  3          0.6               1.00        50      NaN  0.7152174
##   0.3  3          0.6               1.00       100      NaN  0.7018182
##   0.3  3          0.6               1.00       150      NaN  0.6860079
##   0.3  3          0.8               0.50        50      NaN  0.6957708
##   0.3  3          0.8               0.50       100      NaN  0.6852964
##   0.3  3          0.8               0.50       150      NaN  0.6703557
##   0.3  3          0.8               0.75        50      NaN  0.7159684
##   0.3  3          0.8               0.75       100      NaN  0.7069565
##   0.3  3          0.8               0.75       150      NaN  0.6882609
##   0.3  3          0.8               1.00        50      NaN  0.7241502
##   0.3  3          0.8               1.00       100      NaN  0.7074308
##   0.3  3          0.8               1.00       150      NaN  0.7029644
##   0.4  1          0.6               0.50        50      NaN  0.7260870
##   0.4  1          0.6               0.50       100      NaN  0.7222134
##   0.4  1          0.6               0.50       150      NaN  0.7167194
##   0.4  1          0.6               0.75        50      NaN  0.7445059
##   0.4  1          0.6               0.75       100      NaN  0.7294071
##   0.4  1          0.6               0.75       150      NaN  0.7283794
##   0.4  1          0.6               1.00        50      NaN  0.7215020
##   0.4  1          0.6               1.00       100      NaN  0.7223320
##   0.4  1          0.6               1.00       150      NaN  0.7171146
##   0.4  1          0.8               0.50        50      NaN  0.7353360
##   0.4  1          0.8               0.50       100      NaN  0.7089723
##   0.4  1          0.8               0.50       150      NaN  0.7072727
##   0.4  1          0.8               0.75        50      NaN  0.7292490
##   0.4  1          0.8               0.75       100      NaN  0.7176285
##   0.4  1          0.8               0.75       150      NaN  0.7205929
##   0.4  1          0.8               1.00        50      NaN  0.7427273
##   0.4  1          0.8               1.00       100      NaN  0.7516996
##   0.4  1          0.8               1.00       150      NaN  0.7411858
##   0.4  2          0.6               0.50        50      NaN  0.7095257
##   0.4  2          0.6               0.50       100      NaN  0.6920553
##   0.4  2          0.6               0.50       150      NaN  0.6762055
##   0.4  2          0.6               0.75        50      NaN  0.7312648
##   0.4  2          0.6               0.75       100      NaN  0.7251779
##   0.4  2          0.6               0.75       150      NaN  0.7171542
##   0.4  2          0.6               1.00        50      NaN  0.7339130
##   0.4  2          0.6               1.00       100      NaN  0.7249802
##   0.4  2          0.6               1.00       150      NaN  0.7284585
##   0.4  2          0.8               0.50        50      NaN  0.6974704
##   0.4  2          0.8               0.50       100      NaN  0.6929249
##   0.4  2          0.8               0.50       150      NaN  0.6857708
##   0.4  2          0.8               0.75        50      NaN  0.7117391
##   0.4  2          0.8               0.75       100      NaN  0.6999209
##   0.4  2          0.8               0.75       150      NaN  0.7115415
##   0.4  2          0.8               1.00        50      NaN  0.7249802
##   0.4  2          0.8               1.00       100      NaN  0.7187747
##   0.4  2          0.8               1.00       150      NaN  0.7109486
##   0.4  3          0.6               0.50        50      NaN  0.7022925
##   0.4  3          0.6               0.50       100      NaN  0.6900791
##   0.4  3          0.6               0.50       150      NaN  0.6868379
##   0.4  3          0.6               0.75        50      NaN  0.6967194
##   0.4  3          0.6               0.75       100      NaN  0.6781818
##   0.4  3          0.6               0.75       150      NaN  0.6809091
##   0.4  3          0.6               1.00        50      NaN  0.7107510
##   0.4  3          0.6               1.00       100      NaN  0.6996047
##   0.4  3          0.6               1.00       150      NaN  0.6885771
##   0.4  3          0.8               0.50        50      NaN  0.6883399
##   0.4  3          0.8               0.50       100      NaN  0.6856522
##   0.4  3          0.8               0.50       150      NaN  0.6796443
##   0.4  3          0.8               0.75        50      NaN  0.7054150
##   0.4  3          0.8               0.75       100      NaN  0.6822530
##   0.4  3          0.8               0.75       150      NaN  0.6715415
##   0.4  3          0.8               1.00        50      NaN  0.6976680
##   0.4  3          0.8               1.00       100      NaN  0.6903953
##   0.4  3          0.8               1.00       150      NaN  0.6778656
##   Spec     
##   0.6956821
##   0.7025036
##   0.7071626
##   0.6766401
##   0.6957402
##   0.7014731
##   0.6675327
##   0.6979753
##   0.7029463
##   0.6873149
##   0.6998766
##   0.7003338
##   0.6827358
##   0.7006604
##   0.7036792
##   0.6743324
##   0.7014078
##   0.7032438
##   0.6949492
##   0.6934180
##   0.6933962
##   0.7059869
##   0.7114006
##   0.7086938
##   0.6781350
##   0.6938534
##   0.6999782
##   0.6903556
##   0.7002612
##   0.7003338
##   0.6869521
##   0.6919594
##   0.6889550
##   0.6840058
##   0.6900798
##   0.6957547
##   0.6762482
##   0.6873367
##   0.6789695
##   0.6911393
##   0.7009869
##   0.7039913
##   0.6857837
##   0.6953411
##   0.6965167
##   0.6789260
##   0.6809144
##   0.6820247
##   0.6790058
##   0.6801597
##   0.6812192
##   0.6908128
##   0.6965457
##   0.6877576
##   0.7011030
##   0.7110087
##   0.7139332
##   0.6908273
##   0.6885269
##   0.7041219
##   0.6835051
##   0.7003048
##   0.7052322
##   0.6926633
##   0.6952540
##   0.6964296
##   0.6835051
##   0.6907184
##   0.6949274
##   0.6781640
##   0.6926560
##   0.7013861
##   0.6827358
##   0.6870102
##   0.6880987
##   0.7041292
##   0.7030406
##   0.7034688
##   0.7037591
##   0.6984978
##   0.7042090
##   0.6865747
##   0.6804499
##   0.6846880
##   0.6937808
##   0.6983599
##   0.6998984
##   0.6968142
##   0.7011176
##   0.7023077
##   0.6896009
##   0.6686212
##   0.6690421
##   0.6888752
##   0.6821045
##   0.6778302
##   0.6923149
##   0.6946807
##   0.6908636
##   0.6765602
##   0.6820464
##   0.6804644
##   0.6845791
##   0.6830552
##   0.6837954
##   0.6908636
##   0.6927213
##   0.6915530
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## Sens was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 1, eta
##  = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
##  = 1.
6) Prediction
xgb.pred.b <- predict(mod_xgb_fitbalance, newdata = TestData) 

7) Diagnosis (balance)

Assess model

For this section we will measure the main parameters of the six analyzed models.

Table 2: Summary table for assess the model
Sensitivity Specificity Accuracy
logistic 0.5200000 0.8965517 0.7831325
logistic_balance 0.7200000 0.6436782 0.6666667
decision_tree 0.3066667 0.9252874 0.7389558
decision_tree_balance 0.7333333 0.5517241 0.6064257
lda 0.5200000 0.9022989 0.7871486
lda_balance 0.7333333 0.6954023 0.7068273
qda 0.5466667 0.8390805 0.7510040
qda_balance 0.6533333 0.7183908 0.6987952
fda 0.5466667 0.8505747 0.7590361
fda_balance 0.8533333 0.4885057 0.5983936
mda 0.5466667 0.8505747 0.7590361
mda_balance 0.6933333 0.7126437 0.7068273
rf 0.4666667 0.8735632 0.7510040
rf_balance 0.7600000 0.6379310 0.6746988
nn 0.5333333 0.8793103 0.7751004
nn_balance 0.7733333 0.6149425 0.6626506
xgb 0.5333333 0.8275862 0.7389558
xgb_balance 0.7466667 0.6149425 0.6546185

Evaluation

Evaluation of results

In this chapter we will assesses the degree to which the model we have chosen meets the business objectives and we will try to determine if there is some business reason why this model is deficient.

The process will be to compare the results with the evaluation criteria we determined in chapter 3, business understanding.

The business goal of this analysis was to determine whether a client was at risk of not being able to pay back the credit that has been granted to them, as it would mean a loss for the company and the shareholders.

We will determine it by considering that the copmany will grant a credit only to those who a have a good credite score, which is those who have the response variable positive, and not giving it to those who have a response of zero.

In order to do it, we will have a look at the quantity of false positive that the model generated, as they would be the people that would not be able to pay back the company, and we will try to make an evaluation of the potential losses that the company could make in using the specific model, they should be lower than the 10% of the total amount of credits that the company would be willing to accept.

The one we will look at in the specific are the balanced versions of the neural network, random forest and the xgboot, as they were the ones having the highest performance in terms of all the parameters we are looking at: specificity, sensitivity, accuracy.

RF <- confusionMatrix(as.factor(rf.pred.b), as.factor(TestData$RESPONSE))$table[2,1]

NN <- confusionMatrix(as.factor(nn.pred.b), as.factor(TestData$RESPONSE))$table[2,1]

XGB <- confusionMatrix(as.factor(xgb.pred.b), as.factor(TestData$RESPONSE))$table[2,1]

FP <- data.frame(t(data.frame(RF, NN, XGB)))
names(FP) <- c("False Positive")
FP
##     False Positive
## RF              18
## NN              17
## XGB             19

The table shows the number of false positive instances in the predictions given by each models. As we can see, the lowest value belongs to the neural network, and it’s equal to 17. This means that at least in 14 cases, the model would falsly predict a person belonging to the category that should have a credit granted, while it should not. These cases are risky for the company, as they could result in a default in the payback of the credit and hence in a loss of the company.

However, the models are still quite satisfying, as the false positive are only a low percentage compared to the number of observations that are tested, you can find the values in the following tables.

FP %<>% dplyr::mutate(Model = c("RF", "NN", "XGB"), 
                     FP_Perc = (FP[,1]/nrow(TestData))) %>% dplyr::select("Model", everything())
FP
##   Model False Positive    FP_Perc
## 1    RF             18 0.07228916
## 2    NN             17 0.06827309
## 3   XGB             19 0.07630522

We can see that the 3 models we have chosen have a percentage of false positive that is lower than 10%. However, the test set is quite small, hence we should repeat the testing with more data to make sure that the values are kept this low.

We can calculate the maximum losses that could happen if all the people that belongs to the false positive group will not actually pay back the credit they have been granted.

amount <- data_sel[-val_index,]$AMOUNT

fp.rf <- (ifelse(rf.pred.b == 1 & TestData$RESPONSE == 0, 1, 0))
losses.rf <- sum(fp.rf * amount)

fp.nn <- (ifelse(nn.pred.b == 1 & TestData$RESPONSE == 0, 1, 0))
losses.nn <- sum(fp.nn * amount)

fp.xgb <- (ifelse(xgb.pred.b == 1 & TestData$RESPONSE == 0, 1, 0))
losses.xgb <- sum(fp.xgb * amount)

Losses <- data.frame(losses.rf, losses.nn, losses.xgb)
Losses <- data.frame(t(Losses))
names(Losses) <- "Losses"
Losses %<>% dplyr::mutate(Model = c("RF", "NN", "XGB")) %>% dplyr::select(Model, Losses)
Losses
##   Model Losses
## 1    RF  72166
## 2    NN  56671
## 3   XGB  66966

As we can see, the amounts ranges from 5.667110^{4} to 7.216610^{4}. What is surprising is the fact that the random forest model performs better in terms of predicting the false positives (the percentage is lower compared to the one of the xgb for example), however, it has a higher value for the losses. This means that the XGB probably puts a higher importance on the variable of amount to predict the category of a new person, and tries to minimize the losses as much as possible. It should hence be preferred to the random forest.

We want to determine whether these losses represent a high percentage of the total amount of credit that would be granted to the people belonging to the test set.

sel <- data_sel[-val_index,] #getting the observations unscaled 
pos <- sel %>% dplyr::filter(RESPONSE == 1) %>% dplyr::select(AMOUNT) #selecting only the amount of the credits that are granted

Losses %<>% dplyr::mutate(Losses_Perc = Losses / sum(pos))
Losses
##   Model Losses Losses_Perc
## 1    RF  72166   0.1344427
## 2    NN  56671   0.1055760
## 3   XGB  66966   0.1247553

As we can see, the model that as the lowest percentage is the neural network and it does meet our criteria for the selection of the model. i.e.: having the losses lower than the 10% of the total amount of the credits that would be granted.

However, we can also say that the percentage of the losses given by the random forest and the XGB are exceeding the threshold by less than 2%, hence it could be discussed to also use one of these models if it would mean a lower cost for the company in terms of complexity and computation time. This applies more for the random forest than for the XGB, as we could see that the latter was taking quite some time to be fitted. What is more, the random forest allows for a higher degree of interpretation, while the neural network is more used as a black box.

cbind(FP, Losses[,-1])
##   Model False Positive    FP_Perc Losses Losses_Perc
## 1    RF             18 0.07228916  72166   0.1344427
## 2    NN             17 0.06827309  56671   0.1055760
## 3   XGB             19 0.07630522  66966   0.1247553

We would hence suggest to use a random forest model, as it has among the highest sensitivity, lowest amount of false positive predictions and a percentage of losses that is equal to 0.1344427, while having also a higher degree of interpretability and lower complexity, compared to the other methods that were selected at the end of our modelization chapter.

Moreover, we have seen that not all the variables that are included in the dataset are actually useful for the prediction of the response. This means that the copmany, when evaluating a new customer, should rather focus on getting the information regarding the variables that have been selected, namely CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS, TELEPHONE. This would mean lower costs for the company, as they would spend less time on getting useless information and less space to store them.

Review the process

Overview

We started our data mining with an exploratory data analysis. We looked at the structure of the dataset that we used, which had 32 variables and 1’000 observations. Then, we had a more detailed look at the output variable and we could conclude that we had a binary with a majority of positive instances. Looking at the independent variables, we could see that the continuous ones were skewed and had different scales. We could also identify some errors in the data that were fixed, while no missing values were founded. In the second part of our EDA we built a few category variables, so that we could diminish the number of variables that we needed to use in the modelling, more specifically we built a binary describing the sex of the person, one categorical for the purpose of the credit, one categorical for the property and another one for the residence, to assess if it made sense to aggregate the variables, some chi-squared test were run. To further select the data, we created a simple linear regression and we used the AIC to select only the most significant ones. We were then able to move on to the modelling part, in which we used 6 different models, namely: logistic regression, decision trees, discriminate analysis, random forest, neural network and xgboost. For each of them we fit a model on the unbalanced training set (containing 75% of the data randomly selected) and then compared the predictions it gave to the test set (conatining the remaining 25% of the data). We did also a balancing of the dataset, in order to have around the same amount of the positive and negative values for the response, and we fit the same models on the training set based on this data, built in the same way as before, and compared the predictions to the test set.

Improvements

We believe that what has been done was an accurate analysis of the data, however some improvements could be done in terms of process performance. More specifically, we could see that the variable created describing the sex of the person was not selected, hence it was not necessary to create it. Moreover, the correlation were calculated but not really used for the selection of the variables, they could have been avoided too. We also believe that the coding could have been executed in a more efficient way, as a lot of repetitions were done, specifically in the modelling part. We could have created either a function for the modelling and use is to diminish the lines of code, or find another way to optimize it, e.g. the use of a different library. However, thanks to the caret package, we were already able to optimize a good part of the code, which would have been even longer and more complicated otherwise. What is more, we could have included different models, as some of the ones we used are elementary and were expected to perform poorly compared to more complex ones such as the neural network or the random forest. We could have chosen one simple model in order to compare the results and see if the increase in accuracy, sensitivy and specificity was high enough and then only keep the most performing ones and select some others.

In any case, the results we have found are quite satisfying, as we could still find three models that are giving a prediction that is meeting (or almost) our business success criteria.

Next Steps

To improve the process, another model could be selected, maybe one that has not been considered in our analysis. However, we believe that the results that will be given are already satisfying enough.

Another way to improve the model could be to considered other information that has not been considered in our analysis, such as the number of other credits that are pending or the history of (un)repaid credits.

An alternative way could be to gather other information from other credit companies, banks, insurances, etc., so that it is possible to fit a more powerful model.

Decision

With our analysis, the company should be able to assess the quality of a new customer and predict if it should be a good idea to give them a credit or not. We believe that the company should follow these steps, each time a new customer approaches the firm from now on: 1. Collect the information only regarding the variables that have been selected, namely CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS, TELEPHONE 2. With the information gathered, run a random forest model prediction and determine whether the credit should be granted or not 3. Store the result of the decision 4. In case the credit was given, wait and see if it will be paid back 5. Store the result of the debt settlement 6. Use the new data to fit an upgraded model

Conclusions

Conclusions for the business

Goal: Losses < 10% amount of credit

According to the NASA Technical Reports, human error has been reported as being responsible for 60%‐80% of failure, which means that an automation of the selection process would reduce the error and the risk of the no collecting of the interest generated for the loans. However, the application of this tool must co-exist with experienced personal because there are some factors that must be taken into account for example the verification of the documents provided for the application. In addition, there would be an improvement in response times and the workload towards staff.

Conclusions for data mining

At the beginning of the project, we asked ourselfs some questions that would help us meet our main objective, through which we will provide some conclusions.

Are there any variables that could be grouped?

Yes, we have tested the independence of some variables and we have grouped them. In this case we created some dummy variables, for instance the purpose of the credit variable have 6 differents levels.

Have we used all the original independent variables of the model?

No, at the end we select the 15 variables which brings more information to the model, they are: CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS, TELEPHONE

Is the data balanced regarding the answer variable?

No, the data is not balance. At the beginning of the model we detect a greater inclination for the prediction of positives, which means that the data is bias and to correct it we change the parameters for the training of each model to balance the data and maximize the sensitivity.

Does it make sense to balance the data to avoid the model being biased?

Yes, the new solutions with balanced data, in general, showed greater accuracy and sensitivity.

Accuracy, sensivity or specificity, which we need to be more focus on?

In our case, we focus on maximizing the sensitivity, the positive prediction ratio is key to avoid the prediction of a false positive, this could increases the risk of giving the credit to a client cannot meet the payments, therefore, the bank could not collect the interest.

Which model fit better?

As, we mention earlier in the evaluation section, we decided to go for the random forest model. It is the model that best manages the trade-off between the accuracy and sensitivity and with the lowest percentage of losses

References

Classification and Regression Training: CARET R documentation https://www.rdocumentation.org/packages/caret/versions/6.0-86

Xie, A. Y. J. J. (2020, April 26). R Markdown: The Definitive Guide. Retrieved from https://bookdown.org/yihui/rmarkdown/

Xie, A. Y. J. J. (2020, April 26). R Markdown: Code Chunk chapter https://rmarkdown.rstudio.com/lesson-7.html